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Preface 



PAKDD 2001, Hong Kong, 16-18 April, was organized by the E-Business Tech- 
nology Institute of The University of Hong Kong in cooperation with ACM Hong 
Kong, IEEE Hong Kong Chapter, and The Hong Kong Web Society. It was the 
Fifth Pacific- Asia Conference on Knowledge Discovery and Data Mining and the 
successor of earlier PAKDD conferences held in Singapore (1997), Melbourne, 
Australia (1998), Beijing, China (1999), and Kyoto, Japan (2000). 

PAKDD 2001 brought together participants from universities, industry, and 
government to present, discuss, and address both current issues and novel appro- 
aches in the practise, deployment, theory, and methodology of Knowledge Dis- 
covery and Data Mining. The conference provides an international forum for the 
sharing of original research results and practical development experiences among 
researchers and application developers from the many KDD related areas inclu- 
ding machine learning, databases, statistics, internet, e-commerce, knowledge 
acquisition, data visualization, knowledge-based systems, soft computing, and 
high performance computing. 

The PAKDD 2001 conference included technical sessions organized around 
important subtopics such as: Web Mining; Text Mining; Applications and Tools; 
Interestingness; Feature Selection; Sequence Mining; Spatial and Temporal Mi- 
ning; Concept Hierarchies; Association Mining; Classification and Rule Induc- 
tion; Clustering; and Advanced Topics and New Methods. 

Following careful review of the 152 submissions by members of the interna- 
tional program committee 38 regular papers and 22 short papers were selected 
for presentation at the conference and for publication in this volume. 

The conference program also included invited keynote presentations from 
three international researchers and developers in data mining: H. V. Jagadish 
of the University of Michigan, Ronny Kohavi of Blue Martini, and Hongjun 
Lu of the University of Science and Technology, Hong Kong. Abstracts of their 
presentations are included in this volume. 

The conference presented six tutorials from experts in their respective dis- 
ciplines: An Introduction to MARS (Dan Steinberg); Static and Dynamic Data 
Mining Using Advanced Machine Learning Methods (Ryszard S. Michalski); Se- 
quential Pattern Mining: From Shopping History Analysis to Weblog Mining 
and DNA Mining (Jiawei Han and Jian Pei); Recent Advances in Data Mining 
Algorithms for Large Databases (Rajeev Rastogi and Kyuseok Shim); Web Mi- 
ning for E-Commerce (Jaideep Srivastava); and From Evolving Single Neural 
Networks to Evolving Ensembles (Xin Yao). 

Associated workshops included: Spatial and Temporal Data; Statistical Tech- 
niques in Data Mining; and Data Mining and Electronic Business. 




VI 



Preface 



A conference such as this can only succeed as a team effort. We would like 
to thank the program committee members and reviewers for their efforts and 
the PAKDD steering committee members for their invaluable input and advice. 
Our sincere gratitude goes to all of the authors who submitted papers. We are 
grateful to our sponsors for their generous support. Special thanks go to Ms 
Winnie Yau, E-Business Technology Institute, The University of Hong Kong, 
for her considerable efforts, seamlessly keeping everything running smoothly and 
coordinating the many streams of the conference organization. 

On behalf of the organizing and program committees of PAKDD 2001 we 
trust you found the conference a fruitful experience and hope you had an en- 
joyable stay in Hong Kong. 
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Incompleteness in Data Mining 



Hosagrahar Visvesvaraya Jagadish* 



University of Michigan, 
Ann Arbor 
jag@rmiich.edu 



Abstract. Database technology, as well as the bulk of data mining tech- 
nology, is founded upon logic, with absolute notions of truth and false- 
hood, at least with respect to the data set. Patterns are discovered ex- 
haustively, with carefully engineered algorithms devised to determine all 
patterns in a data set that belong to a certain class. For large data sets, 
many such data mining techniques are extremely expensive, leading to 
considerable research towards solving these problems more cheaply. 

We argue that the central goal of data mining is to find SOME interesting 
patterns, and not necessarily ALL of them. As such, techniques that 
can find most of the answers cheaply are clearly more valuable than 
computationally much more expensive techniques that can guarantee 
completeness. In fact, it is probably the case that patterns that can be 
found cheaply are indeed the most important ones. 

Furthermore, knowledge discovery can be the most effective with the 
human analyst heavily involved in the endeavor. To engage a human an- 
alyst, it is important that data mining techniques be interactive, hope- 
fully delivering (close to) real time responses and feedback. Clearly then, 
extreme accuracy and completeness (i.e., finding all patterns satisfying 
some specified criteria) would almost always be a luxury. Instead, incom- 
pleteness (i.e., finding only some patterns) and approximation would be 
essential. 

We exemplify this discussion through the notion of fascicles. Often many 
records in a database share similar values for several attributes. If one 
is able to identify and group together records that share similar values 
for some - even if not all - attributes, one can both obtain a more 
parsimonious representation of the data, and gain useful insight into the 
data from a mining perspective. Such groupings are called fascicles. We 
explore the relationship of fascicle-finding to association rule mining, and 
experimentally demonstrate the benefit of incomplete but inexpensive 
algorithms. We also present analytical results demonstrating both the 
limits and the benefits of such incomplete algorithms. 



* Supported in part by NSF grant IIS-0002356. Portions of the work joint with Cinda 
Heeren, Raymond Ng, and Lenny Pitt 
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Mining E-Commerce Data: 
The Good, the Bad, and the Ugly 

Ronny Kohavi 

Director of Data Mining, 

Blue Martini Software 
ronnykObluemartini . com 



Abstract. Electronic commerce provides all the right ingredients for 
successful data mining (the Good). Web logs, however, are at a very 
low granularity level, and attempts to mine e-commerce data using only 
web logs often result in little interesting insight (the Bad). Getting the 
data into minable formats requires significant pre-processing and data 
transformations (the Ugly). In the ideal e-commerce architecture, high 
level events are logged, transformations are automated, and data mining 
results can easily be understood by business people who can take action 
quickly and efficiently. Lessons, stories, and challenges based on mining 
real data at Blue Martini Software will be presented. 
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Seamless Integration of Data Mining with 
DBMS and Applications 



Hongjun Lu 

The Hong Kong University of Science and Technology, 

Hong Kong, China 
luhjOcs .ust .hk 

Abstract. Data mining has been widely recognized as a powerful tool 
for exploring added value from data accumulated in the daily operations 
of an organization. A large number of data mining algorithms have been 
developed during the past decade. Those algorithms can be roughly di- 
vided into two groups. The fist group of techniques, such as classihcation, 
clustering, prediction and deviation analysis, has been studied for a long 
time in machine learning, statistics, and other fields. The second group of 
techniques, such as association rule mining, mining in spatial-temporal 
databases and mining from the Web, addresses problems related to large 
amounts of data. Most classical algorithms in the first group assume that 
the data to be mined is somehow available in memory. Although initial ef- 
fort in data mining has concentrated on making those algorithms scalable 
with respect to large volume of data, most of those scalable algorithms, 
even developed by database researchers, are still stand-alone. It is of- 
ten assumed that data is available in desired forms, without considering 
the fact that most organizations store their data in databases managed 
by database management systems (DBMS). As such, most data min- 
ing algorithms can only be loosely coupled with data infrastructures in 
organizations and are difficult to infuse into existing mission-critical ap- 
plications. Seamlessly integrating data mining techniques with database 
applications and database management systems remains an open prob- 
lem. 

In this paper, we propose to tackle the problem of seamless integration 
of data mining with DBMS and applications from three directions. First, 
with the recent development of database technology, most database man- 
agement systems have extended their functionality in data analysis. Such 
capability should be fully explored to develop DBMS-awre data mining 
algorithms. Ideally, data mining algorithms can be fully implemented 
using DBMS supported functions so that they become database appli- 
cation themselves. Second, major difficulties in integrating data mining 
with applications are algorithm selection and parameter setting. Reduc- 
ing or eliminating mining parameters as much as possible and develop- 
ing automatic or semi-automatic mining algorithm selection techniques 
will greatly increase the application friendliness of data mining systems. 
Lastly, standardizing the interface among databases, data mining al- 
gorithms and applications can also facilitate the integration to certain 
extent. 
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Applying Pattern Mining to Web 
Information Extraction 



Chia-Hui Chang, Shao-Chen Lui, and Yen-Chin Wu 



Dept, of Computer Science and Information Engineering 
National Central University, Chung-Li, 320, Taiwan 
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Abstract. Information extraction (IE) from semi-structured Web 
documents is a critical issue for information integration systems on the 
Internet. Previous work in wrapper induction aim to solve this problem 
by applying machine learning to automatically generate extractors. 
Eor example, WIEN, Stalker, Softmealy, etc. However, this approach 
still requires human intervention to provide training examples. In this 
paper, we propose a novel idea to IE, by repeated pattern mining and 
multiple pattern alignment. The discovery of repeated patterns are 
realized through a data structure call PAT tree. In addition, incomplete 
patterns are further revised by pattern alignment to comprehend all 
pattern instances. This new track to IE involves no human effort 
and content-dependent heuristics. Experimental results show that the 
constructed extraction rules can achieves 97 percent extraction over 
fourteen popular search engines. 

Keywords: information extraction, semi-structured documents, wrap- 
per generation, pattern discovery, multiple alignment 



1 Introduction 

Information extraction (IE) is concerned with extracting from a collection of 
documents the information relevant to a particular extraction task. For instance, 
the meta-search engine MetaCrawler extracts the search results from multiple 
search engines; and the shopping agent Junglee extracts the product information 
from several online stores for comparison. With the growth of the amount of 
online information, the availability of robust, flexible IE has become a stringent 
necessity. 

Contrast to “traditional” information extraction which roots in natural lan- 
guage processing (NLP) techniques such as linguistic analysis, Internet informa- 
tion extraction rely on syntactic structures identification marked by HTML tags. 
The difference is due to the nature of Web such that the page contents have to 
be clear at glance. Thus, “itemized list” and “tabular format” have been the 
main presentation style for Web pages on the Internet. Such presentation styles 
together with the multiple records contained in one documents contribute the 
so called semi-structured Web pages. 
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The major challenge of IE is the problem of scalability as the extraction 
rules must be tailored for each particular page collection, To automate the con- 
struction of extractors (or wrappers), recent research has identified important 
wrapper classes and induction algorithms. For example, Kushmerick et. al. iden- 
tified a family of wrapper classes and the corresponding induction algorithms 
which generalize from labeled examples to extraction rules jS]. More expressive 
wrapper structure are introduced lately. Softmealy by Hsu and Dung 0 uses a 
wrapper induction algorithm to generate extractors that are expressed as finite- 
state transducers. Meanwhile, Muslea et al. jE] proposed “STALKER” that 
performs hierarchical information extraction to redeem Softmealy’s inability to 
use delimiters that do not immediately precede and follow the relevant items 
with extra scans over the documents (see 1 1 I j for a complete survey) . 

In all this work, wrappers are induced from training examples such that 
landmarks or delimiters can be generalized from common prefixes or suffixes. 
However, labeling these training examples is sometimes time-consuming. Hence, 
another track of research is exploring new approaches to fully automate wrapper 
construction. For example, Embley et. al. describe a heuristic approach to dis- 
cover record boundaries in Web documents by identifying candidate separator 
tags using five independent heuristics and selecting a consensus separator tag 
based on a heuristic combination P]. However, one serious problem in this one- 
tag separator approach arises when the separator tag is used elsewhere among a 
record other than the boundary. 

On the other hand, our work here attempts to eliminate human intervention 
by pattern mining. The motivation is from the observation that useful informa- 
tion in a Web page is often placed in a structure having a particular alignment 
and order. For example, Web pages produced by Web search engines generally 
have regular and repetitive patterns, which usually represent meaningful and 
useful data records. In the next section, we first give an example showing the 
repeated pattern formed by multiple aligned records. 

2 Motivation 

One observation from Web pages is that the information to be extracted is often 
placed in a structure having a particular alignment and forms repetitive patterns. 
For example, query-able or search-able Internet sites such as Web search engines 
often produce Web pages with large itemized match results which are displayed in 
a particular template format. The template can be recognized when the content 
of each match is ignored or replaced by some fixed-length string. Therefore, 
repetitive patterns are formed. For instance, in the example of Figure E the 
sequence “<LI>Text(_)<I>Text(_)</I>” is repeated four times, when all text 
strings between two tags such as “Congo”, “Egypt”, “Belize” etc. are replaced 
by token class Text(_). 

This is a simple example that demonstrates a repeated pattern formed by 
tag tokens in a Web page following a simple translation convention. In prac- 
tice, many search-able Web sites also exhibit such repeated patterns since they 



6 



C.-H. Chang, S.-C. Lui, and Y.-C. Wn 



<Hl>Country Code</HlxUL> 
<LI>Congo<I>242</I> 
<LI>Egypt<I>20< /I> 
<LI>Belize<I>501</I> 
<LI>Spain<I>34< /I> 

</UL> 



Fig. 1. Sample HTML page 



usually extract data from relational database and produce dynamic Web pages 
with a predefined format style. Therefore, what we ought to do is kind of reverse 
engineering to discover the original format style and the content we need to 
extract. Meanwhile, we also find that extraction patterns of the desired informa- 
tion (called main information block as defined in 0) often occur regularly and 
closely in a Web page. These observations motivate us to look for an approach 
to discover repeated patterns and validation criteria to filter desired repeats that 
are spaced regularly and closely. 

Since HTML tags are the basic components for data presentation and the text 
string between tags are exactly what we see in the browsers. Hence, it is intuitive 
to regard the text string between two tags as one unit as well as each individual 
tag. This simple version of HTML translation will be used in the following paper 
where any text string between two tags is translated to one unit called Text(_) 
and every HTML tag is translated to a token Html(<tag>) according to its tag 
name. 

Such translation convention enables the show-up of many repeated patterns. 
By repeated patterns, we mean any substring that occurs twice in the encoded 
token string. Thus, not only the sequence “Html(<LI>) Text(_) Html(<I>) 
Text(_) Html(<I>)” conforms to the definition of repeated pattern but also the 
subsequence “Html(<LI>) Text(_) Html(<I>),” “Text(_) Html (<I>) Text(_)”, 
“HMLT(<I>)Text(_)Html(</I>),” etc. To distinguish from these repeats, we 
define maximal repeats to uniquely identify the longest pattern as follows. 

Definition Given an input string S, we define maximal repeat a as a substring 
of S that occurs in k distinct positions Pi,P2, --nPh in S, such that the {pi~ 
l)th token in S is different from the (pj-l)th token for at least one i,j pair, 
1 < * < j < (called left maximal), and the {px + |a|)th token is different 
from the {py + |o;|)th token for at least one x,y pair, 1 < x < y < k (called 
right maximal). 

The definition of maximal repeats is necessary for identifying the well-used 
and popular term, repeats. Besides, it also captures all interesting repetitive 
structures in a clear way and avoids generating overwhelming outputs. In the 
next section, we will describe how the problem of IE can be addressed by pattern 
discovery. 



Applying Pattern Mining to Web Information Extraction 



7 



3 IE by Pattern Discovery 

To discover patterns from an input Web page, first an encoding scheme is used 
to translate the Web page into a string of abstract representations, referred 
to here as tokens. Each token is represented by a binary code of length 1. To 
enable pattern discovery, we utilizes a data structure called a PAT tree 0 in 
which repeated patterns in a given sequence can be efficiently identified. Using 
this data structure to index an input string, all possible repeats, including their 
occurrence counts and their positions in the original input string can be easily 
retrieved. Finally, the discovered maximal repeats are forwarded to the validator, 
which filters out undesired patterns and to produces a candidate pattern. 



3.1 Translator 

Since HTML tags are the basic components for document presentation and the 
tags themselves carry a certain structure information, it is intuitive to examine 
the tag token string formed by HTML tags and regard other non-tag text content 
between two tags as one single token called Text(_). Tokens seen in the translated 
token string include tag tokens and text tokens, denoted as Html(<tag_name>) 
and Text(_), respectively. For example, Html(</a>) is a tag token, where </a> 
is the tag. Text(_) is a text token, which includes a contiguous text string located 
between two HTML tags. 

Tags tokens can be classified in many ways. The user can choose a classifica- 
tion depending on the desired level of information to be extracted. For example, 
tags in the BODY section of a document can be divided into two distinct groups: 
block-level tags and text-level tags. The former defines the structure of a doc- 
ument, and the latter defines the characteristics, such as format and style, of 
the contents of the text. Block level tags include categories such as headings, 
text containers, lists, and other classifications, such as tables and forms. Text- 
level tags are further divided into categories including logical markups, physical 
markups, and special markups for marking up texts in a text block. 

The many different tag classifications allow different HTML translations to 
be generated. With these different abstraction mechanisms, different patterns 
can be produced. For example, skipping all text-level tags will result in higher 
abstraction from the input Web page than all tags are included. In addition, dif- 
ferent patterns can be discovered and extracted when different encoding schemes 
are translated. 

For example, when only block-level tags are considered, the corresponding 
translation of Fig. [Qis a token string: 

“Html(<Hl>)Text(_)Html(</Hl>) Html(<UL>)Html(<LI>) 

Text(_)Html(<LI>)Text(_)Html(<LI>)Text(_)Html(<LI>) Text(_) Html(</UL>)”, 
where each token is encoded as a binary strings of ”0”s and ”l”s with length 1. 
For example, suppose three bits encode the tokens in the Congo code as shown 
in Fig. 0 The encoded binary string for the token string of the Congo code will 
be ”100110 101000 010110 010110 010110 010110 001$” of 3*13 bits, where ”$” 
represents the ending of the encoded string. 
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IItm!(<IIl>)Text(JHtm!(</Hl>)IItnil(<UL>)Htnil(<U>)Text(J 

IItinl(<LI>)Text(JHtm!(<LI>)Text(_>Itni](<LI>)Text(_)Html(</UL>) 





000 




001 


Html(<LI>) 


010 




Oil 


Html(<IIl>) 


100 




101 


Text(_) 


110 1 



[ 100110101000010110010110010110010110001 $ ~ 



Indexing position: 

suffix 1 looiioioiooooioiiooionooioiiooioiiooois 

suffix 2 noioiooooioi 100101 100101 100101 lOOOlS 

suffix 3 1010000101 100101 100101 1001 01 lOOOlS 

suffix 4 0000101 100101 100101 100101 lOOOlS 

suffix 5 0101 100101 100101 100101 lOOOlS 

suffix 6 1100101 100101 100101 lOOOlS 

suffix 7 0101 100101 100101 10001 S 

suffixSllOOlOllOOlOllOOOlS 

suffix 9 010110010110001S 

suffixlO IIOOIOIIOOOIS 

suffixll OlOllOOOlS 

suffixl2 110001$ 

suffixl3 001S 




Fig. 2. The PAT tree for the Congo Code 



3.2 The PAT Tree 

Our approach for pattern discovery uses a PAT tree to discover repeated patterns 
in the encoded token string. A PAT tree is a Patricia tree (Practical Algorithm 
to Retrieve Information Coded in Alphanumeric El) constructed over all the 
possible suffix strings. A Patricia tree is a particular implementation of a com- 
pressed binary (0,1) digital tree such that each internal node in the tree has 
two branches: zero goes to left and one goes to right. Like a suffix tree 0, the 
Patricia tree stores all its data at the external nodes and keeps one integer, the 
bit-index, in each internal node as an indication of which bit is to be used for 
branching. For a character string with n indexing point (or n suffix), there will 
be n external nodes in the PAT tree and n — 1 internal nodes. This makes the 
tree 0{n) in size. 

When a PAT tree is to index a sequence of characters (or tokens here) not 
just 0 or 1, the binary codes for the characters can be used. For simplicity, 
each character is encoded as fixed- length binary code. Specifically, given a finite 
alphabet if of a fixed size, each character x G if is represented by a binary code 
of length I = |"log 2 lif]] • For a sequence S ofn characters, the binary input B will 
have n * I bits, but only the [i*l + l]th bit has to be indexed for i = 0, . . . ,n— 1. 

Referring to Fig.El a PAT tree is constructed from the encoded binary string 
of the Congo example. The tree is constructed from thirteen sequences of bits, 
with each sequence of bits starting from each of the encoded tokens and extending 
to the end of the token string. Each sequence is called a ’’semi-infinite string” or 
’’sistring” in short. Each leaf, or external node, is represented by a square labeled 
by a number that indicates the starting position of a sistring. For example, leaf 
2 corresponds to sistring 2 that starts from the second token in the token string. 
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Each internal node is represented by a circle, which is labeled by a bit position 
in the encoded bit string. The bit position is used when locating a given sistring 
in PAT tree. 

Virtually, each edge in the PAT tree has a edge label. For example, the edge 
labels between node d and e are “101100”, the 8th bit to 13th bit for suffix 9, 
7, and 5. Edges that are visited when traversing downward from root to a leave 
form a path that leads to a sistring corresponding to the leave. The concatenated 
edge labels along the path form a virtual path label. For example, the edge labels 
”1”, ”10”, and ”1...” on the path that leads from root to leave 2 form a prefix 
”1101...”, which is a unique prefix for sistring 2. 

As shown in Fig. 13 all suffix strings with the same prefix will be located in 
the same subtree. Hence, it allows surprisingly efficient, linear-time solutions to 
complex string search problems. For example, string prefix searching, proximity 
searching, range searching, longest repetition searching, most frequent searching, 
etc. TO Since every internal node in a PAT tree indicates a branch, it implies 
a different bit following the common prefix between two suffixes. Hence, the 
concatenation of the edge-labels on the path from the root to an internal node 
represents one repeated string in the input string. However, not every path-label 
or repeated string represents a maximal repeat. Let’s call the (p^ — l)th character 
of the binary string pk the left character. For a path-label of an internal node v to 
be a maximal repeat, at least two leaves (suffixes) in the u’s subtree should have 
different left characters. By recording the occurrence counts and the reference 
positions in the leaf nodes of a PAT tree, we can easily know how many times a 
pattern is repeated. Hence, given the pattern length, occurrence count, we can 
apply postorder traversal to the PAT tree to enumerate all repeats. 

The essence of a PAT tree is a binary suffix tree, which has also been applied 
in several research field for pattern discovery. For example, Kurtz and Schleier- 
macher have used suffix trees in bioinformatics for finding repeated substring in 
genomes 0 • As for PAT trees, they have been applied for indexing in the field of 
information retrieval since a long time ago 0. It has also been used in Chinese 
keyword extraction for its simpler implementation than suffix trees and its 
great power for pattern discovery. However, in the application of information 
extraction, we are not only interested in repeats but also repeats that appear 
regularly in vicinity. Discovered maximal repeats have to be further validated 
or compared to find the best one that corresponds to the information to be 
extracted. 



3.3 Pattern Validation Criteria 

In the above section, we discussed how to find maximal repeats in a PAT tree. 
However, there may be over 60 maximal repeats discovered in an Web page. 
To classify these maximal repeats, we introduce two measures regularity, and 
compactness as described below. Let the suffixes of a maximal repeat a are 
ordered by its position such that suffix pi < p 2 < P 3 . . . < Pk, where pi denotes 
the position of each suffix in the encoded token sequence. 
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Regularity of a pattern is measured by computing the standard deviation of the 
interval between two adjacent occurrences (pi+i —Pi), that is, the sequence of 
spacing between two adjacent occurrences {p 2 —Pi), (P 3 —P 2 ), ■■■, (Pk—Pk-i)- 
Regularity of the maximal repeat a is equal to the standard derivation of 
the sequence divided by the mean of the sequence. 

Compactness is a measure of the density of a maximal repeat. It is used to 
eliminate maximal repeats that are scattered far apart beyond a given bound. 
Compactness is defined as A: * \a\/Yli= 2 Pi ~ Pi-it where |a| is the length of 
a in number of tokens. 

The value of regularity is located between 0 and 1 while the value of density 
is greater than 0. Ideally, the extraction pattern should have regularity equal to 
zero and compactness equal to one. To filter potentially good patterns, a simple 
approach will be to use a threshold for each of these measures above. Implicitly, 
good patterns have small regularity and density close to one. Therefore, only 
patterns with regularity less than the regularity threshold and density between 
the density thresholds are considered validated patterns. 

4 Performance Evaluation 

We first show the number of validated maximal repeats validated by our sys- 
tem using fourteen state-of-the-art search engines, each with ten Web pages. 
There are several control parameters which can affect the number of maximal 
repeats validated, including encoding scheme, minimum pattern length, occur- 
rence count, and threshold values for regularity and compactness. Given the 
minimum length 3 and count 5, the effect of different encoding scheme is shown 
in Table ^ Conform to general expectation, higher-level encoding scheme of- 
ten results in less patterns. From this table, we can also see how each control 
parameter filters patterns where the thresholds are decided by the following 
experiments. The value of density can be greater than one because maximal re- 
peats may be overlapped. For example, suppose a maximal repeat a occurs ten 
times in a row. In such case, a will has regularity 0 and density 1. In addition, 
aa, aaa, etc. are also qualified for regular maximal repeats, only with density 
greater than 1. 



Table 1. No. of Patterns validated with different encoding scheme 



Encoding 


Maximal 


Regularity 


Compactness 


Scheme 


Repeat 


< 0.5 


> 1.5 


< 0.25 


All-tag 


117 


39 


22 


7.6 


NoPhysical 


88 


41 


25 


6.5 


NoSpecial 


82 


29 


18 


5.7 


Block-level 


66 


32 


17 


3.9 
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Fig. 3. ^ of patterns sucessfully validated 



Fig. □ shows the effect of various regularity and density thresholds using 
all-tag encoding scheme. Basically, low regularity threshold and high density 
threshold reduce the number of patterns, but could have missed good patterns. 
Therefore, the thresholds are chosen empirically to include as many good pat- 
terns as possible. 

Table |3 shows the performance of different encoding scheme measured in 
retrieval rate, accuracy rate and matching percentage. Retrieval rate is defined 
as the ratio of the number of desired data records enumerated by a maximal 
repeat to the number of desired data records contained in the input text. Like- 
wise, accuracy rate is defined as the ratio of the number of desired data records 
enumerated by a maximal repeat to the number of occurrence of the maximal 
repeat. A data record is said to be enumerated by a maximal repeat if the match- 
ing percentage is greater than a bound determined by the user. The matching 
percentage is used because the pattern may contain only a portion of the data 
record. 

With the simple encoding scheme of using block-level tags, our approach 
could discover patterns which extract 86% records with matching percentage 
78%. Nearly half the test Web sites are correctly extracted (with matching per- 
centage greater than 90%). Among them, nine of the fourteen Web sites have 
retrieval rate and accuracy rate both greater than 0.9. However, examining other 
discovered patterns, many are incomplete due to exceptions. In the next section, 
we will further improve the performance by occurrence partition and multiple 
string alignment. 



5 Constructing Extraction Pattern 

Generally speaking, search engines utilize a “while loop” to output their results 
in some template. However, they may use “if clauses” inside the while loop 
to decorate the text content. For example, the keywords that are submitted 
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Table 2. Performance of different encoding scheme 



Encoding Scheme 


Retrieval Rate 


Accuracy Rate 


Matching Percentage 


All-tag 


0.73 


0.82 


0.60 


NoPhysical 


0.82 


0.89 


0.68 


NoSpecial 


0.84 


0.88 


0.70 


Block-level 


0.86 


0.86 


0.78 



to search engines are shown in bold face for Infoseek and MetaCrawler, thus, 
breaking their “while loop” patterns. 

From the statistics above, we summarize that “maximal repeat” and “reg- 
ularity” are the two primary criteria we filter candidate patterns. However, we 
also found that the extraction pattern may not be maximal repeats and regu- 
lar. For example, the regularity of the pattern for Excite is greater than default 
regularity threshold 0.5 because a banner is inserted among the search results, 
dividing the ten matches into two parts. Besides, the “if-effect” often hinders us 
from discovering complete patterns. These issues are what we would address in 
the following. 

5.1 Occurrence Partition 

To handle patterns with regularity greater than the specified threshold 0.5, these 
patterns are carefully segmented to see if any partition of the pattern’s occur- 
rences satisfies the requirement for regularity. By definition, the regularity of 
a pattern is computed through all occurrences of the pattern. For example, if 
there are k occurrences, the k — 1 intervals (between two adjacent instances) are 
the statistics we use to compute the standard deviation and the mean. However, 
in examples such as Lycos, the search result is divided into three blocks. Such 
occurrences increase the regularity over all instances. Nonetheless, the regularity 
of the occurrences in each information block is still small. Therefore, the idea 
here is to segment the occurrences into partitions so that we can analyze each 
partition individually. 

We don’t really have to apply clustering algorithm on this matter, instead, 
a simple loop can accomplish the job if the occurrences are ordered by their 
position aforehand. Let Cij denotes the set of occurrences pt,pi+i, and 
initialize s = 1, j = 1. For instance Pj+i, if the regularity of Csj+i is greater 
than 9 then output Cgj as a partition and assign j -|- 1 to s. 

Once the partitions are separated, we can then compute the regularity for 
each individual partition. If a partition includes occurrences more the minimum 
count and has regularity less than threshold e, the pattern as well as the occur- 
rences in this partition are outputted. Note that the threshold e is set to a small 
value much less than 0.5 to control the number of outputted patterns. With 
this modification, the performance is improved greatly. As shown in Table 0 the 
retrieval rate is increased to 93% and accuracy rate to 90%. The only tradeoff is 
the increased number of patterns from 3.9 to 8.9. 
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Table 3. Performance of advanced technique 



Advanced Technique 


Retrieval Rate 


Accuracy Rate 


Matching Percentage 


Occurrence Partition 


0.93 


0.93 


0.84 


Multiple Alignment 


0.97 


0.94 


0.90 



5.2 Multiple String Alignment 

For the tough work regarding incomplete pattern discovered, the technique for 
multiple string alignment is borrowed to find a good presentation of the critical 
common features of multiple strings. For example, suppose “adc” is the discov- 
ered pattern for token string “adcwbdadcxbadcxbcadc” . If we have the following 
multiple alignment for strings “adcwbd\ “adcxb” and “adcxbd’: 

a d cw b d 
adcxb— 
adcxbd 

The extraction pattern can be generalized as ''^adc[w\x\b[d\—\' to cover these 
three instances. Specifically, suppose a validated maximal repeat has fc -I- 1 oc- 
currence, pi, p2, ..., Pk+i in the encoded token string. Let string Pi denote the 
string starting at pi and ending at Pi+i — 1. The problem is to find the multiple 
alignment of the k strings S = {Pi,P 2 , ■■■,Pk} so that the generalized pattern 
can be used to extract all records we need. 

Multiple string comparison is a natural generalization of alignment for two 
strings which can be solved in 0{n * m) by dynamic programming to obtain 
optimal edit distance, where n and m are string lengths. As an example of two 
string alignment, consider the alignment of two strings acwbd and adcxb shown 
below: 

a — c w b d 
adcxb— 

In this alignment, character w is mismatched with x, two ds are opposite hy- 
phens (or called space), and all other characters match their counterparts in the 
opposite string. If we give each match a value of /3, each mismatch a value of 
7 , and each space a value of S, the two string alignment problem is to optimize 
the weighted distance D{P\, P 2 ) = {nmatch * fi + nmis * 7 - 1 - nspace * 5), where 
nmatch, nmis, and nspace denote the number of mismatch, match, and space, 
respectively (nmatch = 3, nmis = 1, and nspace = 2 here). 

Extending dynamic programming to multiple string alignment yieds a 0(n^) 
algorithm. Instead, an approximation algorithm is available such that the score 
of the multiple alignment is no greater than twice the score of optimal multi- 
ple alignment jS]. The approximation algorithm starts by computing the center 
string Sc in k strings S that minimizes consensus error D(Sc, Pi)- Once 

the center string is found, each string is then iteratively aligned to the center 
string to construct multiple alignment, which is in turn used to construct the 
extraction pattern. 
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For each patterns with density less than one, the center star approximation 
algorithm for multiple string alignment is applied to generalize the extraction 
pattern. Suppose the generalized extraction pattern is expressed as “ciC 2 C 3 ...c„”, 
where each Ci is either a symbol or a subset of HU {— } containing symbols that 
can appear at position i. An additional step is taken to generate pattern of 
this form ‘cjCj+iCj+ 2 ...c„ciC 2 ...Cj_i” for position j with single symbol of the 
following special tags such as <DL>, <DT>, <TR> or <P>, <BR>, <HR>, 
because extraction patterns often begin or end up with them0. 

We adopt this additional step because the generated extraction pattern may 
not be the beginning of a record. The experimental results show that with the 
help of multiple string alignment and the additional step, the performance is 
improved to 97% retrieval rate, 94% accuracy rate and 0.90 matching percentage. 
The high percentage of retrieval rate is pretty encouraging. The ninety percent of 
matching percentage is actually higher in terms of the text content retrieved. For 
those Web sites with matching percentage greater than 85%, the text contents are 
all successfully extracted. What bothers is the accuracy rate, since the extraction 
pattern generalized from multiple alignment may comprehends more than the 
information we need. For example, the generalized rule for Lycos will extract 
information in all three blocks while only the information in one block is what 
we desired, causing lower accuracy rate. 



6 Summary and Future Work 

Information extraction from Web pages is a core technology for comparison- 
shopping agents j2j, which Doorenbos et. al. regard as improvement in the axe 
of tolerating unstructured information. The characteristics of regularity, unifor- 
mity, and vertical separation enable the possibility of learning. In this paper, 
we have presented an unsupervised approach to semi-structured information ex- 
traction. We propose the application of PAT trees for pattern discovery in the 
encoded token string of Web pages. Once the PAT tree is constructed, we can 
easily traverse the tree to find all maximal repeats given the expected pattern 
frequency and length. The discovered maximal repeats are further filtered by 
three measures: regularity and compactness. The filtering criteria aim to keep 
the number of patterns as small as possible while at the same time have all 
interesting patterns. Furthermore, occurrence partition is applied to handle pat- 
terns with regularity greater than the default threshold. Finally, multiple string 
alignment is applied to patterns with density less than one to generalize extrac- 
tion pattern. Thereby, the extraction module can simply adapt pattern matching 
algorithm to extract all records. 

The extraction rule generalized from multiple string alignment has achieved 
97% retrieval rate and 91% accuracy rate. The whole process requires no human 
intervention and training example. Comparing our algorithm to others, our ap- 
proach is quick and expressive. It takes only three minutes to extract 140 Web 

Other tags include <TABLE>, <TD>, <UL>, <OL>, <LI>, <DD>. 
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pages. The extraction rule allowing alternative tokens and missing tokens, can 
tolerate exceptions and variance in the input. 

We are currently applying this approach against more test data formatted in 
tabular form, which perform at the level of 80% retrieval rate. As more variances 
occur in input pages, it becomes even difficult to have good multiple string 
alignment. In such cases, the scoring of edit distance between two strings and the 
algorithm to construct multiple alignment become more important. In addition, 
filtering of the constructed patterns can also provide a reasonable number of 
patterns for user to choose. 

Acknowledgements. This work is sponsored by National Science Council, 
Taiwan under grant NSC89-2213-E-008-056. Also, we would like to thank Lee- 
Feng Chien, Ming-Jer Lee and Jung-Liang Chen for providing their PAT tree 
code for us. 

References 

1. Chien, L.F. 1997. PAT-tree-based keyword extraction for Chinese information re- 
trieval. In Proceedings of the 20th annual international ACM SIGIR conference on 
Researeh and development in information retrieval, pp. 50-58. 1997. 

2. Doorenbos, R.B., Etzioni, O. and Weld, D. S. A scalable comparison-shopping 
agent for the World-Wide Web. In Proceedings of the first international conference 
on Autonomous Agents, pp. 39-48, New York, NY, 1997, ACM Press. 

3. Embley, D.; Jiang, Y.; and Ng. Y.-K. 1999. Record-boundary discovery in Web 
documents. In Proceedings of the 1999 ACM SIGMOD International Conference 
on Management of Data (SIGMOD’99). pp. 467-478, Philadelphia, Pennsylvania. 

4. Gonnet, G.H.; Baeza-yates, R.A.; and Snider, T. 1992. New Indices for Text: Pat 
Trees and Pat Arrays. Information Retrieval: Data Structures and Algorithms, 
Prentice Hall. 

5. Gusfield, D. 1997. Algorithms on strings, trees, and seguences, Cambridge. 1997. 

6. Hsu, C.-N. and Dung, M.-T. 1998. Generating finite-state transducers for semi- 
structured data extraction from the Web. Information Systems. 23(8):521-538. 

7. Knoblock, A. et ah, ed., 1998. Proc. 1998 Workshop on AI and Information Inte- 
gration, Menlo Park, California.: AAAI Press. 

8. Kurtz, S. and Schleiermacher, C. 1999. REPuter: fast computation of maximal 
repeats in complete genomes. Bioinformaties 15(5):426-427. 

9. Kushmerick, N.; Weld, D.; and Doorenbos, R. 1997 Wrapper induction for infor- 
mation extraction. In Proceedings of the 15th International Joint Conference on 
Artificial Intelligence (IJCAI). 

10. Muslea, I.; Minton, S.; and Knoblock, C. 1999. A hierarchical approach to wrap- 
per induction. In Proeeedings of the 3rd International Conference on Autonomous 
Agents (Agents’99), Seattle, WA. 

11. Muslea, I. 1999. Extraction patterns for information extraction tasks: a survey. In 
Proeeedings of AAAI’99: Workshop on Machine Learning for Information Extrac- 
tion 




Empirical Study of Recommender Systems 
Using Linear Classifiers 



Vijay S. Iyengar and Tong Zhang 

IBM Research Division, T. J. Watson Research Center, 
P.O. Box 218, Yorktown Heights, NY 10598, U.S.A. 



Abstract. Recommender systems use historical data on user prefer- 
ences and other available data on users (e.g., demographics) and items 
(e.g., taxonomy) to predict items a new user might like. Applications 
of these methods include recommending items for purchase and per- 
sonalizing the browsing experience on a web-site. Collaborative filter- 
ing methods have focused on using just the history of user preferences 
to make the recommendations. These methods have been categorized 
as memory-based if they operate over the entire data to make predic- 
tions and as model-based if they use the data to build a model which 
is then used for predictions. In this paper, we propose the use of lin- 
ear classifiers in a model-based recommender system. We compare our 
method with another model-based method using decision trees and with 
memory-based methods using data from various domains. Our experi- 
mental results indicate that these linear models are well suited for this 
application. They outperform the commonly proposed approach using a 
memory-based method in accuracy and also have a better tradeoff be- 
tween off-line and on-line computational requirements. 



1 Introduction 

Recommender systems use historical data on user preferences and purchases 
and other available data on users and items to recommend items that might 
be interesting to a new user. One of the earliest techniques developed for rec- 
ommendations was based on nearest-neighbor eollahorative filtering algorithms 
P2| that used just the history of user preferences as input. Sometimes in the 
literature the term eollahorative filtering is used to refer to just these methods. 
However, we will follow the taxonomy introduced by in which collaborative 
filtering {CF) refers to a broader set of methods that use prior preferences to 
predict new ones. In this taxonomy, nearest-neighbor collaborative filtering al- 
gorithms are categorized as being memory-based CF. Nearest-neighbor methods 
use some notion of similarity between the user for whom predictions are being 
generated and users in the database. Variations on this notion of similarity and 
other aspects of memory-based algorithms are discussed in P|. Scalability is an 
issue with nearest-neighbor methods. Proposed methods of addressing this is- 
sue range from the use of data structures like R-trees to the use of dimension 
reduction techniques like latent semantic indexing 
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In contrast, model-based CF methods use the historical data to build mod- 
els which are then used for predicting new preferences. A model-based approach 
using Bayesian networks was found to be comparable to the memory-based ap- 
proach in PI- More recently, models based on a newer graphical representation 
called dependency networks have been applied to this problem 0. For this 
task, dependency network models seem to have slightly poorer accuracy but re- 
quire significantly less computation when compared to Bayesian network models 
p]. Another model based method is to use clustering to group users based on 
their past preferences. The parameters for this clustering model can be estimated 
by methods like Gibbs sampling and EM PEE). The clustering model explored 
in PI was outperformed by the model-based approach using Bayesian networks 
and by the memory-based approach CR-I- described in |3|. 

In this paper, we explore the use of various linear classifiers in a model-based 
approach to the recommendation task. Linear classifiers have been quite suc- 
cessful in the text classification domain |^. Some of the characteristics shared 
between the text and CF domains include the high dimensionality and sparse- 
ness of the data in these domains. The main computational cost of using linear 
classifiers is in the model build phase, which is an off-line activity. The appli- 
cation of the models is very straightforward especially with sparse data. Our 
empirical study will use two data sets that reflect users’ browsing behavior and 
one data set that captures their purchases. Because of its wider applicability, we 
focus on data that is implicitly gathered, e.g., boolean flag for each web page 
representing whether or not it was browsed as in the anonymous-msweb dataset 
in m- This is in contrast with explicitly collected data, e.g., ratings explicitly 
gotten for movies HH. Section 3 presents results achieved on these datasets by 
various model-based approaches using linear classifiers. For comparison we also 
include results achieved by our implementation of the memory-based algorithm 
CR-I- described in |5] and a model-based approach using decision trees. The 
linear models studied in this paper are described in the next section. 



2 Model-Based Approaches 

The problem of predicting whether a user (or a customer) will accept a specific 
recommendation can be modeled as a binary classification problem. However, in 
a recommender system, we are also interested in the likelihood that a customer 
will accept a recommendation. This information can be used to rank all of the 
potential choices according to their likelihoods, so that we can select the top 
choices to present to the customer. It is thus necessary that the classifier we use 
returns a score (or a confidence level), where a higher score corresponds to a 
higher possibility that the customer will accept the recommendation. 



2.1 Linear Models 

Formally, a two-class categorization problem is to determine a label y G {—1,1} 
associated with a vector x of input variables. A useful method for solving this 
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problem is by linear discriminant functions, which consist of linear combinations 
of the input variables. Various techniques have been proposed for determining 
the weight values for linear discriminant classifiers from a training set of labeled 
data {xi,yi), . . . , {xn, Vn)- Specifically, we seek a weight vector w and a threshold 
0 such that nFx < 9 \i its label y = —1 and nF x > 9 \i its label y = 1. A score 
of value w'^x — 9 can be assigned to each data point to indicate the likelihood 
of x to be in class. 

The problem just described may readily be converted into one in which the 
threshold 9 is taken to be zero. One does this by converting a data point x 
in the original space into x = [x, 1] in the enlarged space. Each hyperplane w 
in the original space with threshold 9 can then be converted into [w, —9\ that 
passes through the origin in the enlarged space. Instead of searching for both 
an d-dimensional weight vector along with a threshold 9, we can search for an 
{d + l)-dimensional weight vector along with an anticipated threshold of zero. 
In the following, unless otherwise indicated, we assume that the vectors of input 
variables have been suitably transformed so that we may take 9 — Q. We also 
assume that x and w are d-dimensional vectors. 

Many algorithms have been proposed for linear classification. We start our 
discussion with the least squares algorithm, which is based on the following 
formulation to compute a linear separator w\ 



w = arg min 

w n 



1 " 

in - - yif 

n n f ^ 






( 1 ) 



The least squares method is extensively used in engineering and statistics. Al- 
though the method has mainly been associated with regression problems, it can 
also be used in classification. Examples include use in text categorization m 
and uses in combination with neural networks m- 
The solution of (P) is given by 



n n 



One problem with the above formulation is that the matrix may be 

singular or ill-conditioned. This occurs, for example, when n is less than the 
dimension of x. Note that in this case, for any w, there exists infinitely many 
solutions w of nFxi = uFxi for i = 1, . . . , n. This implies that m has infinitely 
many possible solutions w. 

A remedy of this problem is to use a pseudo-inverse H2|. However, one prob- 
lem of the pseudo-inverse approach is its computational complexity. In order 
to handle large sparse systems, we need to use iterative algorithms which do 
not rely on matrix factorization techniques. Therefore in this paper, we use the 
standard ridge regression method that adds a regularization term to dU: 



w = arg mm — 
w n 



'^{w'^Xiyi - 1)^ -k Aw^, 
2=1 



(2) 
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where A is an appropriately chosen regularization parameter. The solution is 
given by 

n n 

W = + Xniy^C^Xiy,), 

2=1 2=1 



where I denotes the identity matrix. Note that X)r=i always 

be non-singular, which solves the ill-condition problem. The regularized least 
squares formulation (0 can be solved by using a column relaxation method 
which is often called the Gauss-Seidel procedure in numerical optimization. The 
algorithm (see Algorithm [0 in Appendix A) cycles through components of w, 
and optimizes one component at a time (while keeping others fixed). 

Another popular method is the support vector machine, which is a method 
originally proposed by Vapnik that has nice properties from the sample 

complexity theory. Slightly different from our approach of forcing threshold 0 — 
0, and then compensating by appending 1 to each data vector, the standard linear 
support vector machine (cf. m) explicitly includes 0 in a quadratic formulation 
that can be transformed to: 



where 



1 " 

{w,9) = arginf - g{y^{vF Xi - 9)) + Au>^ 

in fi n 



w,e n ^ 
2=1 



g{z) = 



1 - z if z < 1, 
0 if z > 1. 



( 3 ) 

( 4 ) 



It is interesting to compare the least squares approach and the support vector 
machine approach. In the least squares formulation, the loss function (z — 1)^ 
implies that we try to find a weight w such that w'^x ~ 1 for an in-class data 
point X, and w'^x ~ —1 for an out-of-class data point x. Although this means 
that the formulation attempts to separate the in-class data from the out-of- 
class data, it also penalizes a well behaved data point x such that vFxy > 1. 
The support vector machine approach remedies this problem by choosing a loss 
function that does not penalize a well-behaved data point such that nP" xy > 1. 

A popular method to obtain the numerical solution of an SVM is the SMO 
algorithm Hg. However, in general solving an SVM is rather complicated since g 
is not smooth. In this paper, we intentionally replace g{z) by a smoother function 
to make the numerical solution simple. The following formulation modifies the 
least squares algorithm so that it does not penalize a data point with vlP xy > 1, 
and it has a loss function that is more smooth than that of an SVM: 



. 1 

w = arg mm — 
w n 



n 

h{w'^Xiyi) + Aw^, 

Z=1 



( 5 ) 



where 



^ f(l-z)2if z< 1 
\0 ifz>l 



( 6 ) 
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This formulation is a mixture of the least squares method and a standard 
SVM. We thus call it modified least squares. Furthermore, a direct numerical 
optimization of Q can be performed relatively efficiently. Similar to (|21l, the 
Algorithm 0 in Appendix A solves 0 

Another way to solve © is given in Algorithm|3in Appendix A. It is derived 
by using convex duality. Because of space limitations, we skip the analysis. 

2.2 Other Models 

In the recommender system application, interpretability of the models used is an 
important characteristic to be considered in addition to the accuracy achieved 
and the computational requirements. We have included a decision tree based 
recommender system in this empirical study as an example of using an inter- 
pretable model. In this decision tree package, the splitting criteria during tree 
growth is a modified version of entropy and the tree pruning is done using a 
Bayesian model combination approach originated from data compression CHI 
CH]. A similar approach has appeared in m- 

We have also implemented a version of the nearest neighbor algorithm, CR-I-, 
described in Cl and included it in our study. As suggested in 0, inverse user 
frequency, case amplification and default voting heuristics are used in our im- 
plementation of CR-I-. 

3 Experiments 

The true value of a recommender system can only be measured by controlled 
experiments with actual users. Such an experiment could measure the relative 
lift achieved by a specific recommendation algorithm when compared to, say, 
recommending the most popular item. Experiments with historical data have 
been used to estimate the value of recommender algorithms in the absence of 
controlled live experiments |8l4j . In this paper we will follow experimental pro- 
cedures similar to those introduced in 0. 



3.1 Data Sets 

Characteristics of the data sets used in our experiments are given in Table 0 
The first dataset msweb was introduced in jSj and added to the UCI repository 
under the name anonymous-msweb. As described in 0, this dataset contains 
for each user the web page groups (called vroots) that were visited in a fixed 
time period. The total number of items is relatively small (around 300) for this 
dataset and this can be attributed to the fact that an item refers to a group of 
web pages. 

The second dataset pageweb also captures visits by users to a different web 
site but at the individual page level (with about 6000 total items). Intuitively, 
one might expect the task of recommending specific pages to be more difficult 
than that of recommending page groups. But the other factor to be considered 
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Table 1. Description of the data sets. 



Characteristics 


Dataset 


msweb 


pageweb 


wine 


Training cases 


32711 


9195 


13103 


Total test cases 


5000 


1804 


2610 


Test cases with at least 2 items (All But 1 ) 


3453 


1243 


1770 


Test cases with at least 3 items {Given 2) 


2213 


932 


1280 


Test cases with at least 6 items {Given 5) 


657 


455 


624 


Test cases with at least 11 \tems{Given 10) 


102 


168 


268 


Total items 


294 


5781 


663 


Mean items per case 
in training set 


3.02 


4.36 


4.60 



is that we also have more fine-grained information at the individual page level 
about user preferences that can be used by the models. This dataset will be 
useful in evaluating how the various algorithms handle recommending from a 
large number of items. 

The third dataset wine represents wine purchases made by customers of a 
leading supermarket chain store within a specified period. The dataset captures 
for each customer the wines purchased in this period as a binary value (purchased 
versus not purchased). 

We have chosen to use a binary representation of the item/page variables in 
all the experiments. An alternative representation would be use more information 
like the number of visits to a web page or the time spent viewing a web page or 
the quantity of wine purchased. 



3.2 Experimental Setup 

Following the experimental setup introduced in |E], the datasets are split into two 
disjoint sets (training and test) of users. The entire set of visits (or purchases) 
for users in the training set is available for the model build process. The known 
visits (purchases) for users in the test set are split into two disjoint sets: given 
and hidden. The given set is used by the recommender methods to rank all the 
remaining items in the order of predicted preference to the user. This ranked 
list is evaluated by using the hidden set as the reference indicating what should 
have been predicted. 

The evaluation metric, R, proposed in Pj is based on the assumption that 
each successive item in a list is less likely to be interesting to the user with an 
exponential decay. This metric uses a parameter, a, which is the position in the 
ranked list which has a 50-50 chance of being considered by the user. As in jS] 
we will set a such that the fifth position has a 50-50 chance of being considered. 
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The exponential decay in interest, which forms the basis for the R metric, 
may be a plausible behavior model for consumers. However, this metric is not 
easy to interpret. Also, the number of allowed recommendations can also be 
constrained by the environment. For example, the form factor of a hand-held 
device might restrict the number of recommendations on it to a small number. 
Hence, we will also report results using a simpler metric which measures the 
fraction of users for whom at least one valid recommendation (according to the 
hidden set as reference) was given in the top K items of the ranked list. In 
particular, we will report this metric for K values 1, 3 and 10. 

The split of the test set data into the given and hidden sets is done as 
suggested in ^]. Three of these splits are denoted as Given 2, Given 5, and 
Given 10. These have 2, 5, and 10 items chosen into the given sets, respectively. 
The fourth split is denoted as All But 1 because one item for each test user is 
randomly chosen to be hidden in this scenario. These scenarios can be used to 
assess how each recommender system handles different amounts of information 
being known and to be hidden (predicted) for each test user. Table [D provides 
for each dataset the number of test users that are included in each of these 
scenarios. 

Each scenario will be run five times with different random choices for the split 
between given and hidden subsets in the test data. Mean values and standard 
deviations are computed over these five experiments. We have adopted this ap- 
proach to be compatible with the prior literature with regard to the training/test 
splits. A more traditional approach would have been to use n-fold cross valida- 
tion where both training and test sets are different in the experiments. However, 
given the compatibility constraints, performing multiple experiments with the 
given/hidden splits provides some information on the experimental variability. 

3.3 Results 

The results achieved for all the four scenarios {Given 2, Given 5, Given 10, All 
But 1) are given in Tables 0 0 and Elfor the datasets msweb, pageweh and wine, 
respectively. The format in which these results are provided in these tables for 
each combination of algorithm and scenario is explained in Table The mean 
and the standard deviation for the R metric (expressed as percentage) are given 
on top. The three numbers below indicate the percentage of test users that 
had at least one successful recommendation in the top 1, 3 and 10 positions of 
the ranked list. The linear models were generated using 25 iterations with the 
regularization parameter A set at 0.001. 

The baseline approach of recommending popular items does significantly 
poorer when compared to the other algorithms on datasets with more items. 
The decision tree model also exhibits this pattern of not performing as well on 
datasets with more items. 

As mentioned earlier, one advantage of model-based methods is that the 
model build is done off-line. The model build times for the dataset msweb were 
around 500 seconds for linear least squares and modified linear (primal) mod- 
els and around 200 seconds for modified linear (dual) model. These times were 
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Table 2. Explanation of the entries in the Tables 0 El and El 



R metric ± std. dev. 


Success 
within top 
1 


Success 
within top 
3 


Success 
within top 
10 



Table 3. Results on dataset msweb. For explanation on entries refer to Table El A is 
set at 0.001 for the linear methods. 



algorithm 


Given2 


Given5 


GivenlO 


AllButl 


Popular 


46.5 ± 0.2 


43.7 ± 0.5 


41.6 ± 2.0 


46.5 ± 0.6 


33.9 


55.8 


82.0 


29.0 


55.6 


80.6 


32.2 


57.1 


79.8 


22.7 


38.3 


63.8 


CR+ 


56.7 ± 0.1 


54.2 ± 0.6 


51.5 ± 1.9 


60.8 ± 0.6 


45.0 


70.5 


88.7 


39.9 


68.3 


88.1 


43.7 


67.7 


87.8 


34.6 


54.8 


76.2 


Decision 

Tree 


53.4 ± 0.3 


54.3 ± 0.7 


53.0 ± 1.0 


62.3 ± 0.5 


46.6 


71.3 


87.4 


48.0 


73.9 


88.5 


51.6 


72.9 


87.8 


38.4 


58.4 


74.9 


Least 

Squares 


55.7 ± 0.3 


57.5 ± 0.7 


57.0 ± 1.5 


64.1 ± 0.5 


46.9 


72.4 


89.6 


49.9 


75.0 


90.9 


55.5 


77.1 


91.0 


38.5 


58.8 


79.2 


Mod LS 
Primal 


55.6 ± 0.3 


57.7 ± 0.8 


56.9 ± 1.4 


64.4 ± 0.5 


46.9 


72.6 


89.8 


50.3 


75.2 


91.0 


56.1 


76.9 


91.8 


38.9 


59.1 


79.6 


Mod LS 
Dual 


55.2 ± 0.2 


57.5 ± 0.9 


56.7 ± 1.4 


64.4 ± 0.6 


46.5 


72.9 


89.7 


50.5 


75.6 


90.5 


57.1 


77.3 


91.8 


39.0 


59.0 


79.4 






msweb 



pageweb 



wine 



Fig. 1. Linear least squares classifier accuracy vs. regularization parameter A. 
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Table 4. Results on dataset pageweb. For explanation on entries refer to Table 0 A is 
set at 0.001 for the linear methods. 



algorithm 


Given2 


Given5 


Given 10 


AllButl 


Popular 


8.3 ± 0.3 


7.0 ± 0.3 


6.2 ± 0.7 


7.6 ± 0.5 


7.5 


14.5 


29.5 


6.0 


12.4 


27.9 


6.0 


11.6 


29.5 


2.3 


5.1 


11.1 


CR-h 


29.3 ± 0.8 


31.9 ± 1.1 


32.5 ± 1.4 


33.3 ± 0.3 


28.8 


47.6 


67.2 


30.2 


50.1 


68.7 


33.5 


55.7 


74.2 


15.2 


27.5 


44.9 


Decision 

Tree 


16.2 ± 0.1 


19.9 ± 0.6 


23.3 ± 1.2 


22.7 ± 0.7 


23.3 


33.5 


47.0 


28.9 


41.6 


52.7 


36.9 


50.7 


63.2 


13.5 


20.2 


28.4 


Least 

Squares 


27.7 ± 0.3 


32.5 ± 0.9 


35.4 ± 0.8 


34.9 ± 0.6 


30.1 


48.4 


66.8 


33.7 


53.2 


71.6 


41.7 


62.9 


77.6 


17.1 


29.8 


46.6 


Mod LS 
Primal 


28.3 ± 0.3 


33.0 ± 0.9 


35.7 ± 1.1 


35.5 ± 0.5 


30.3 


48.7 


67.7 


33.8 


53.3 


73.4 


41.1 


61.4 


77.9 


17.2 


30.2 


47.4 


Mod LS 
Dual 


27.8 ± 0.3 


32.9 ± 0.9 


35.5 ± 1.2 


35.2 ± 0.5 


29.9 


48.7 


67.1 


34.6 


53.4 


73.1 


41.8 


62.3 


77.6 


17.4 


30.0 


46.7 



Table 5. Results on dataset wine. For explanation on entries refer to Table 0 A is set 
at 0.001 for the linear methods. 



algorithm 


Given2 


Given5 


Given 10 


AllButl 


Popular 


15.3 ± 0.2 


14.5 ± 0.2 


14.2 ± 0.4 


13.6 ± 0.8 


10.1 


21.7 


47.6 


6.8 


20.4 


50.0 


5.4 


19.9 


51.2 


5.3 


9.2 


19.8 


CR-h 


23.7 ± 0.3 


24.6 ± 0.4 


26.7 ± 0.8 


21.4 ± 0.5 


20.3 


37.3 


60.3 


21.6 


38.7 


61.6 


25.5 


42.5 


65.3 


8.5 


16.7 


29.8 


Decision 

Tree 


16.9 ± 0.4 


18.6 ± 0.4 


22.1 ± 0.5 


17.9 ± 0.4 


16.8 


27.8 


49.0 


18.9 


33.9 


52.8 


21.4 


41.3 


60.1 


7.6 


14.3 


24.1 


Least 

Squares 


21.1 ± 0.3 


24.7 ± 0.4 


28.1 ± 0.3 


22.2 ± 0.5 


19.4 


35.8 


57.1 


23.5 


43.2 


63.8 


25.9 


46.3 


68.1 


8.8 


17.5 


31.3 


Mod LS 
Primal 


20.8 ± 0.2 


24.4 ± 0.4 


27.8 ± 0.3 


22.3 ± 0.4 


19.2 


35.7 


56.9 


23.9 


42.8 


63.6 


25.7 


46.9 


67.2 


8.7 


17.5 


31.2 


Mod LS 
Dual 


19.7 ± 0.2 


23.4 ± 0.4 


27.1 ± 0.5 


21.8 ± 0.4 


18.0 


34.0 


55.6 


23.1 


41.8 


62.9 


26.6 


46.2 


66.9 


8.6 


17.3 


30.2 



recorded using our prototype implementation on an IBM RISC System/6000, 
Model 43P-140 using a 332 MHz PowerPC 604e processor. These model build 
times can be compared to those reported in 0 for Bayesian networks (144.65 
seconds) and for dependency networks (98.31 seconds) on a 300 MHz Pentium 
system. However, our linear models are orders of magnitude more efficient than 
both Bayesian networks and decision networks jOI when generating recommen- 
dations because of the simplicity in computing the scores for a test user. 



Empirical Study of Recommender Systems Using Linear Classifiers 



25 



The accuracy achieved on the public dataset msweb cannot be directly com- 
pared with the results in 0 and 0 because of random choices made in the 
given/hidden sets. The results in Tables 0 El and 0 suggest that linear least 
squares and the primal and dual forms of the modified version fare well in com- 
parison with our implementation of CR-I-. For example, the linear (dual) model 
is more accurate than CR-I- in 11 out of the 12 experimental setups (3 datasets 
with 4 scenarios in each) using the success in the top 3 metric. If we use the 
R metric the linear (dual) model beats CR-I- in 8 out of the 12 experimental 
setups. 

The impact of the regularization parameter A on the accuracy of one of the 
linear models (least squares) is shown in Figure 0 Similar behavior has been 
observed for the other linear models. The choice of A makes a non-negligible 
difference for all algorithms. The value for this parameter could be chosen using 
cross-validation experiments with the training data, though this was not done 
in our study. The figures also suggest that in practice, one may choose a fixed 
A with reasonable performance across a number of datasets, without any cross- 
validation A selection. We would like to mention that for all algorithms, the value 
of A should be same for every potential recommendation item. Otherwise, the 
computed scores w'^x will not be comparable for different items. A side effect 
is that we only have a single A to determine for each algorithm. Therefore a 
cross-validation procedure can be used to determine this value stably. 

It is interesting to observe that in this application, the standard least squares 
method does as well as the more complicated modified least squares method. This 
phenomenon is not true for many other classification problems. One explanation 
could be that we use the ranking of the items, rather than a form of classifica- 
tion error, to measure the quality of a recommendation system. The impact of 
penalizing a point w'^x > 1 in the formulation for this task is clearly different 
when compared to a standard classification task. 

4 Conclusion 

This paper presents a model-based approach to recommender systems using lin- 
ear classification models. We focus on three different linear formulations and the 
corresponding algorithms for building the models. Experiments are performed 
with three datasets and recommendation accuracies compared using two differ- 
ent metrics. The experiments indicate that the linear models are more accurate 
than a memory-based collaborative filtering approach reported earlier. This im- 
proved accuracy in combination with the better computational characteristics 
makes these linear models very attractive for this application. 

Acknowledgments. We would like to thank Murray Campbell, Richard 
Lawrence and George Almasi for their help during this work. 
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Appendix A. Details of Algorithms 

Algorithm 1 (Least Squares Primal) 

let w = 0 and Tj = 0 

for k = 1,2,... 
for j = 1 ,... ,d 

+ XnwJ)/{J2^ + An) 

update r: + AwjXijyi (z = 1, . . . , n) 

update w: Wj = Wj + Awj 

end 

end 

Algorithm 2 (Mod Least Squares Primal) 
let w = 0 and r* = 0 

pick a decreasing sequence of 1 = ci > C 2 > • • • > c*: = 0 
for fc = 1,2, ... ,A 

define function Ck{ri) = 1 if < 1 and Ck{ri) = otherwise 

for j = 1, . . . ,d 

Awj = -0.5[X)i Ck{ri){ri - l)xijy^ + Xnwj]/[J2i Ck{ri)x')j + An] 
update r: Ui = ri + AwjXijyi (z = 1, . . . , n) 

update w: wj = Wj + Awj 

end 

end 



Algorithm 3 (Mod Least Squares Dual) 



let C = 0 and Uj = 0 for j = 1, . . . ,d 

for k = 1,2,... 
for z = 1, . . . , n 

update v: Vj = vj + AQxijyi 
update C: 0 = 0 + ^0 

end 

end 

let 



(j = 1, . . . , d) 
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Abstract. With the rapid growth of e-commerce applications, Internet shopping 
is becoming part of our daily lives. Traditional Web-based product searching 
based on keywords searching seems insufficient and inefficient in the ’sea’ of 
information. In this paper, we propose an innovative intelligent multi-agent 
based environment, namely (iJADE) - intelligent Java Agent Development 
Environment - to provide an integrated and intelligent agent-based platform in 
the e-commerce environment. In addition to contemporary agent development 
platforms, which focus on the autonomy and mobility of the multi-agents, 
iJADE provides an intelligent layer (known as the ’conscious layerj to 
implement various AI functionalities in order to produce ’smart’ agents. Erom 
the implementation point of view, iJADE eMiner consists of two main modules: 
1) a visual data mining and visualization system for automatic facial 
authentication based on the FAgent model, and 2) a fuzzy-neural based 
shopping agent (FShopper) to facilitate Web-mining on Internet shopping in 
cyberspace. 



1 Introduction 

Owing to the rapid development in e-commerce, ranging from C2C e-commerce 
applications such as e-auction to sophisticated B2B e-commerce activities such as e- 
Supply Chain Management (eSCM), the Internet is becoming a common virtual 
marketplace for us to do our business, search for information and communicate with 
each other. 

However, owing to these ever-increasing tons of information in cyberspace, 
information searching, or more precisely knowledge discovery and Web-mining is 
becoming the critical key to success at doing business in the cyberworld. Moreover, 
with the advance of PC computing technology in terms of computational speed and 
popularity, intelligent software applications known as agents, with their autonomous 
properties, automatic delegation of jobs, and highly mobile and adaptive behavior in 
the Internet environment, are becoming a potential area of development for e-business 
in the new millennium [12]. 

In a typical e-shopping scenario, there are two fundamental aspects of functionality 
in which Web-mining and visual data mining might help. The first one is customer 
authentication. Traditional authentication, based on username and password over a 
security transport layer such as SSL protocol, although providing a secured user 
authentication scheme, requires customer pro-active login to grant the access right, 
which may discourage the customer from his or her shopping intention. Other 
authentication schemes based on digital certificate with smart card technology [22], or 
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biometric authentication techniques based on iris or palm recognition, might provide 
some alternatives of automatic authentication scheme. However, they all need special 
authentication equipment which limits usability in the e-commerce environment, not 
to mention the legal implications of accessing personal privacy data such as iris and 
palm patterns. In contrast, automatic authentication based on human face recognition 
can get rid of all these limitations. In terms of visual processing equipment, the 
standard Web-camera is already good enough for facial pattern extraction, which is 
nowadays more or less standard equipment for Web browsing. Moreover, this kind of 
authentication scheme can provide a truly automatic scheme in which the customer 
does not need to provide any special identity information, and more importantly it 
does not need to explore any ’confidential’ or ’sensitive’ data such as fingerprints and 
iris patterns. 

The other area is the automation of the online shopping process via agent 
technology. Traditional shopping models include Consumer Buying Behavior Models 
such as the Blackwell model [7], the Bettman model [3] and the Howard-Sheth model 
[11], which all share a similar list of six fundamental stages of consumer buying 
behavior: 1) consumer requirement definition, 2) product brokering, 3) merchant 
brokering, 4) negotiation, 5) purchase and delivery, and 6) after-sale services and 
evaluation. In reality, the first three stages in the consumer buying behavior model 
involve a wide range of uncertainty and possibilities - or what we called ’fuzziness’ - 
ranging from the setting of buying criteria and provision of products by the merchant, 
to the selection of goods. So far these are all ’grey areas’ that we need to thoroughly 
explore in order to apply agent technology to the e-commerce environment. 

In this paper, we propose an integrated intelligent agent-based framework, known 
as iJADE (pronounced as 11’) - Intelligent Java Agent-based Development 
Environment. To accommodate the deficiency of contemporary agent software 
platforms such as IBM Aglets [1] and ObjectSpace Voyager Agents [25], which 
mainly focus on multi-agent mobility and communication, iJADE provides an 
ingenious layer called the ’Conscious (Intelligent) Layer’, which supports different AI 
functionalities to the multi-agent applications. Erom the implementation point of 
view, we will demonstrate one of the most important applications of iJADE in the e- 
commerce environment - iJADE eMiner. iJADE eMiner is a truly intelligent agent- 
based Web-mining application which consists of two major modules: 1) an agent- 
based facial pattern authentication scheme - EAgent, an innovative visual data mining 
and visualization scheme using EGDLM technique [14]; 2) a Web-mining tool based 
on fuzzy shopping agents for product selection and brokering - EShopper. This paper 
is organized as follows. Section 2 presents an overview on Web-mining, a vital 
extension of data mining in cyberspace. Section 3 presents the model framework of 
iJADE, and the two major components of iJADE eMiner: EAgent and EShopper. 
System implementation will be discussed in section 4, which is followed by a brief 
conclusion. 



2 Web Mining - A Perspective 

As an important extension of data mining [8], Web-mining is an integrated 
technology of various research fields including computational linguistics, statistics, 
informatics, artificial intelligence (AI) and knowledge discovery [25]. It can also be 
interpreted as the discovery and analysis of useful information from the Web. 
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Although there is no definite principle of the Web-mining model, basically it can be 
categorized into two main areas: Content Mining and Structural Mining [6] [25]. The 
taxonomy of Web-mining is depicted in Fig. 1. 



Web Mining 




- Intelligent search 
engine (text-based) 

- Visual mining (e.g. 
FAgent) 

- Web product mining 
(e.g. FShopper) 



- Web query 
systems 

- Multi-layer 
database mining 



Structural Mining 



- External structure 
mining 

- Internal structure 
mining 

- URL mining 

- Web usage 
mining 



Fig. 1. A taxonomy of Web-mining 

’Content Mining’ focuses on the extraction and knowledge discovery (mining) of 
the Web content, ranging from the HTML, XML documents found in the Web servers 
to the mining of data and knowledge from the data source (e.g. databases) attached to 
the backend of Web systems. On the other hand, ’Structural Mining’ focuses on 
knowledge discovery for the structure of the Web system, including the mining of the 
user preferences on Web browsing (Web usage mining), the usage of the different 
URLs in a particular Website (URL mining), external structure mining (for hyperlinks 
between different Web pages) and internal structure mining (within a particular Web 
page). Active research includes Spertus et al. [24] on internal structure mining and 
Pitkow [21] on Web usage mining. 

With the concern of ’Content Mining’, classical search engines such as Lycos, 
WebCrawler, Infoseek and Alta Vista do provide some sort of searching aid. 
However, they failed to provide concrete and structural information [19]. In recent 
years, interest has been focused on how to provide a higher level of organization for 
semi-structured and even unstructured data on the Web using Web-mining techniques. 
Basically, there are two main research areas: the Database Approach and the Agent- 
based Approach. The Database Approach Web-mining tries to focus on the 
techniques for organizing the semi-structured / unstructured data on the Web into 
more structured information and resources, such that traditional query tools and data 
mining can be applied for data analysis and knowledge discovery. Typical examples 
can be found in the ARANEUS system [2] using a multi-level database for Web- 
mining, and Dunren et al. [5] using a complex Web query system for the Web-mining 
of G-Protein Coupled Receptors (GPCRs). 

The Agent-based Approach focuses on the provision of ’intelligent’ and 
’autonomous’ Web-mining tools based on agent technology. Typical examples can be 
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found in FAQFinder [9] for intelligent search engines and Firefly [23] for 
personalized Web agents. 

In our proposed iJADE intelligent agent model, two innovative Web-mining 
applications from the iJADE eMiner are introduced: 1) FAgent, an automatic visual 
data mining agent based on the EGDLM (Elastic Graph Dynamic Link Model [16]) 
for automatic authentication based on invariant human face recognition, and 2) 
EShopper, a Web-based product mining application using fuzzy-neural shopping 
agents. 

3 iJADE Architecture 

3.1 iJADE Framework: ACTS Model 

In this paper, we propose a fully integrated intelligent agent model called iJADE 
(pronounced JJ’) for intelligent Web-mining and other intelligent agent-based e- 
commerce applications. The system framework is shown in Fig. 2. 




Unlike contemporary agent systems and APIs such as IBM Aglets [1] and 
ObjectSpace Voyager [25], which focus on the multi-agent communication and 
autonomous operations, the aim of iJADE is to provide comprehensive ’intelligent’ 
agent-based APIs and applications for future e-commerce and Web-mining 
applications. 

Fig. 2 depicts the two level abstraction in the iJADE system: a) iJADE system 
level - ACTS model, and b) iJADE data level - DNA model. The ACTS model 
consists of 1) the Application Layer, 2) the Conscious (Intelligent) Layer, 3) the 
Technology Layer, and the 4) Supporting Layer. The DNA model is composed of 1) 
the Data Layer, 2) the Neural Network Layer, and 3) the Application Layer. 

Compared with contemporary agent systems which provide minimal and 
elementary data management schemes, the iJADE DNA model provides a 
comprehensive data manipulation framework based on neural network technology. 
The Data Layer’ corresponds to the raw data and input ’stimulates’ (such as the facial 
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images captured from the Web camera and the product information in the cyberstore) 
from the environment. The ’Neural Network Layer’ provides the ’clustering’ of 
different types of neural networks for the purpose of ’organization’, ’interpretation’, 
’analysis’ and ’forecasting’ operations based on the inputs from the Data Layer’, which 
are used by the iJADE applications in the ’Application Layer’. 

Another innovative feature of the iJADE system is the ACTS mode, which 
provides a comprehensive layering architecture for the implementation of intelligent 
agent systems, will explain in the next sections. 

3.2 Application Layer Including iJADE eMiner 

This is the uppermost layer that consists of different intelligent agent-based 
applications. These iJADE applications are developed by the integration of intelligent 
agent components from the ’Conscious Layer’ and the data ’knowledge fields’ from the 
DNA model. 

Concurrent applications (iJADE vl.O) implemented in this layer include: 

■ MAGICS (Mobile Agent-based Internet Commerce System), a collection of 
intelligent agent-based e-commerce applications including the MAGICS shopper 
(Internet shopping agent), and the MAGICS auction (intelligent auction agent) [4]. 

■ iJADE eMiner, the intelligent Web-mining agent system proposed in this paper. It 
consists of the implementation of 1) EAgent, an automatic authentication system 
based on human face recognition, and 2) EShopper, a fuzzy agent-based Internet 
shopping agent. 

■ iJADE WeatherMAN, an intelligent weather forecasting agent which is the 
extension of previous research on multi-station weather forecasting using fuzzy 
neural networks [20]. Unlike traditional Web-mining agents, which focus on the 
automatic extraction and provision of the latest weather information, iJADE 
WeatherMAN possesses neural network-based weather forecasting capability (AI 
services provided by the ’Conscious Layer’ of the iJADE model) to act as a ’virtual’ 
weather reporter as well as an ’intelligent’ weather forecaster for weather 
prediction. 

■ iJADE WShopper, an integrated intelligent fuzzy shopping agent with WAP 
technology for intelligent mobile shopping on the Internet. 



3.3 Conscious (Intelligent) Layer 

This layer provides the intelligent basis of the iJADE system, using the agent 
components provided by the Technology Layer’. The ’Conscious Layer’ consists of 
the following three main intelligent functional areas: 

1) ’Sensory Area’ - for the recognition and interpretation of incoming stimulates. It 
includes a) visual sensory agents using EGDLM (Elastic Graph Dynamic Link 
Model) for invariant visual object recognition [16][17][18], and b) auditory 
sensory agents based on wavelet-based feature extraction and interpretation 
technique [10]. 

2) Logic Reasoning Area" - conscious area providing different AI tools for logical 
’thinking’ and rule-based reasoning, such as fuzzy and GA (Genetic Algorithms) 
rule-based systems [15]. 
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3) ’Analytical Area’ - consists of various AI tools for analytical calculation, such as 

recurrent neural network-based analysis for real-time prediction and data mining 

[13]. 

3.4 Technology Layer Using IBM Aglets and Java Servlets 

This layer provides all the necessary mobile agent implementation APIs for the 
development of intelligent agent components in the ’Conscious Layer’. 

In the current version (vl.O) of the iJADE model, IBM Aglets [1] are used as the 
agent ’backbone’. The basic functionality and runtime properties of Aglets are defined 
by the Java Aglet, AgletProxy and AgletContext classes. The abstract class Aglet 
defines the fundamental methods that control the mobility and lifecycle of an aglet. It 
also provides access to the inherent attributes of an aglet, such as creation time, 
owner, codebase and trust level, as well as dynamic attributes, such as the arrival time 
at a site and address of the current context. 

The main function of the AgletProxy class is to provide a handle that is used to 
access the aglet. It also provides location transparency by forwarding requests to 
remote hosts and returning results to the local host. Actually, all communication with 
an aglet occurs through its aglet proxy. The AgletContext class provides the runtime 
execution environment for aglets within the Tahiti server. Thus when an aglet is 
dispatched to a remote site, it is detached from the current AgletContext object, 
serialized into a message bytestream, sent across the network, and reconstructed in a 
new AgletContext, which in turn provides the execution environment at the remote 
site. The other critical component of the Aglet environment is the security issue. 
Aglets provide a security model in the form of an AgletSecurityManager, which is a 
subclass of the “standard” Java SecurityManager. 

In this layer, server-side computing using Java Servlet technology is also adopted 
due to the fact that for certain intelligent agent-based applications, such as the 
WShopper, in which limited resources (in terms of memory and computational speed) 
are provided by the WAP devices (e.g. WAP phones), all the iJADE agents 
interactions are invoked in the 'backend' WAP server using Java Servlet technology. 

3.5 Supporting Layer 

This layer provides all the necessary system supports to the 'Technology Layer'. It 
includes 1) Programming language support based on Java, 2) Network protocols 
support such as HTTP, HTTPS, ATP, etc., and 3) Markup languages support such as 
HTML, XML, WML, etc. 



4 Implementation 

4.1 iJADE eMiner: Overview 

In this paper, an iJADE eMiner for intelligent e-shopping is introduced. This system 
consists of two modules: 1) Visual data mining for intelligent user authentication 
module based on FAgent, and 2) Web product mining using neuro-fuzzy shopping 
agent based on FShopper. 
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Intelligent Visual Data Mining for User Authentication Using FAgent 

In short, there are three kinds of intelligent agents operating within the FAgent 
system. They are 1) the FAgent Feature Extractor, a stationary agent situated within 
the client machine to extract the facial features from a facial image which is captured 
from the client’s digital camera, 2) the FAgent Messenger, a mobile agent which acts 
as a messenger which on one hand “carries” the facial image to the server-side agent 
and on the other hand “reports” the latest recognition results back to the client 
machine, and 3) the FAgent Recognizer, a stationary agent situated within the server 
(e.g. virtual shopping mall) using EGDLM (Elastic Graph Dynamic Link Model) for 
invariant facial recognition - an innovative object recognition scheme based on neural 
networks and elastic graph matching techniques, which have promising results in 
human face recognition [16], scene analysis [18] and tropical cyclone identification 
[17]. The main duty is to perform automatic and invariant facial pattern matching 
against the server-side facial database. A schematic diagram of the FAgent is shown 
in Fig. 3. 




Fig. 3. Schematic diagram of FAgent System 

Neural-Fuzzy Agent-Based Shopping Using FShopper 

The system framework of FShopper consists of the following six main modules: 

1 . Customer requirement definition (CRD) 

2. Requirement fuzzification scheme (RFS) 

3. Fuzzy agents negotiation scheme (FANS) 

4. Fuzzy product selection scheme (FPSS) 

5. Product defuzzification scheme (PDS) 

6. Product evaluation scheme (PES) 

A schematic diagram of the Fuzzy Shopper is shown in Fig. 4. 
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Fig. 4. Schematic diagram of FShopper 



4.2 Experimental Results 
FAgent Test 

In the experiment, 100 human subjects were used for system training. A set of 1,020 
tested patterns resulting from different facial expressions, viewing perspectives, and 
sizes of stored templates were used for testing. A series of tested facial patterns were 
obtained with a CCD camera providing a standard video signal, and digitized at 512 x 
384 pixels with 8 bits of resolution. 

FAgent Test I: Viewing Perspective Test 

In this test, a viewing perspective ranging from -30 to h- 30 (with reference to the 
horizontal and vertical axis) was adopted, using 100 test patterns for each viewing 
perspective. Recognition results are presented in Table 1. 



Table 1. Results for viewing perspective test 



Viewing 
perspective 
(from horiz. axis) 


Correct 

Classification 


Viewing 
perspectives 
(from vertical axis) 


Correct 

Classification 


+30° 


84% 


+30° 


86% 


+20° 


90% 


+20° 


88% 


+10° 


92% 


+10° 


91% 


-10° 


91% 


-10° 


92% 


O 

0 


89% 


-20° 


87% 


O 

o 


85% 


-30° 


82% 
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According to the “Rotation Invariant” property of the EGDLM model [16] [18], the 
FAgent possesses the same characteristic in the “contour maps elastic graph 
matching” process. An overall correct recognition rate of over 86% was achieved. 

FAgent Test II: the Facial Pattern Occlusion and Distortion Test 

In this test, the 120 test patterns are basically divided into three categories: 

■ Wearing spectacles or other accessories 

■ Partial occlusion of the face by obstacles such as cups / books (in reading and 
drinking processes) 

■ Various facial expressions (such as laughing, angry and gimmicky faces). 

Pattern recognition results are shown in Table 2. 



Table 2. Recognition results for occlusion /distortion test 



Pattern Occlusion & Distortion Test 


Correct 

Classification 


Wearing spectacles (or other accessoires) 


87% 


Face partially hidden by obstacle (e.g. books, cups) 


72% 


Facial expressions (e.g. laughing, angry and gimmicky faces) 


83% 



Compared with the three different categories of facial occlusion, “wearing 
spectacles” provides the least negative effect to facial recognition, owing to the fact 
that all the main facial contours are still preserved in this situation. In the second 
situation, the influence to the recognition rate depends on the proportion and which 
portion of the face is being obscured. Nevertheless, the average correct recognition 
rate was found to be over 73%. 

Facial expressions and gimmicky faces gave the most striking results. Owing to the 
“Elastic Graph” characteristic of the model, the recognition engine “inherited” the 
“Distortion Invariant” property and an overall correct recognition rate of 83% are 
attained. 

The FShopper Test 

For the product database, over 200 items under eight categories were being used to 
construct the e-catalog. These categories were: T-shirt, shirt, shoes, trousers, skirt, 
sweater, tablecloth, napkins. We deliberately choose soft-good items instead of hard- 
goods such as books or music CDs (as commonly found in most e-shopping agent 
systems), so that it would allow more room for fuzzy user requirement definition and 
product selection. For neural network training, all the e-catalog items were ’pre- 
trained’ in the sense that we had pre-defined the attribute descriptions for all these 
items to be ’fed’ into the fuzzy neural network for product training (for each category). 
Thus totally, eight different neural networks were constructed according to each 
different category of product. 

From the experimental point of view, two sets of tests were conducted: the Round 
Trip Time (RTT) test and the Product Selection (PS) test. The RTT test aims at an 
evaluation of the ’efficiency’ of the FShopper in the sense that it will calculate the 
whole round trip time of the agents, instead of calculating the difference between the 
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arrival and departure time to/from the server. The RTT test will calculate the entire 
"component" time fragments starting from the collection of the user requirement, 
fuzzification process, to the product selection and evaluation steps, so that a total 
picture of the performance efficiency can be deduced. In the PS test, since there was 
no definite answer to whether a product would ’fit’ the taste of the customer or not, a 
sample group of 40 candidates was used to judge the ’effectiveness’ of the FShopper. 
Details are illustrated in the following sections. 

FShopper Test I: The Round Trip Time (RTT) test 

In this test, two iJADE Servers were being used: the T I server and the T2server. 
T I server was situated within the same LAN as the client machine, while the T2server 
was in a remote site (within the campus). Results of the mean RTT after 100 trials for 
each server are shown in Table 3. 

As shown in Table 3, the total RTT is dominated by the Fuzzy Product Selection 
Scheme (FPSS), but the time spent is still within an acceptable timeframe: 5 to 7 
seconds. Besides, the difference of RTT between the server situated in the same LAN 
and the remote site was not significant except in the FANS, whereas the Fuzzy Buyer 
needs to take a slightly longer "trip" than the other. Of course, in reality, it depends 
heavily on the network traffic. 



Table 3. Mean RTT summary after 100 trials 



Time (msec) 


Tlserver 


T2sever 


Server location 


Same LAN as client 


Remote site (within campus) 


A. Client machine 






CRD 


- 




RFS 


310 


305 


B. Server machine 






FANS 


320 


2015 


FPSS 


4260 


4133 


C. Client machine 






PDS 


320 


330 


PFS 


251 


223 


TOTAL RTT 


5461 


7006 



FShopper Test II: The Product Selection (PS) test 



Unlike the RTT test, in which objective figures can be easily obtained, the PS test 
results rely heavily on user preference. In order to get a more objective result, a 
sample group of 40 candidates was invited for system evaluation. In the test, each 
candidate would ’buy’ one product from each category according to his or her own 
requirements. For evaluation, they would browse around the e-catalog to choose a list 
of the ’best five choices’ (L) which ’fit his/her taste’. In comparison with the ’top five’ 
recommended product items (i) given by the fuzzy shopper, the Litness Value (FV)’ is 
calculated as follows: 

5 

n X 

FV = — where i = 

15 



1 if ieL 
0 otherwise 
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In the calculation, scores of 5 to 1 were given to ’correct matches’ of the candidate’s 
first to fifth ’Best five’ choices with the FShopper’s suggestion. For example, if out of 
the five ’best choices’ selected by the customer, products of rank no. 1, 2, 3 and 5 
appear in the fuzzy shopper recommended list, the fitness value will be 73%, which is 
the sum of 1, 2, 3 and 5 divided by 15. Results under the eight different categories are 
shown in Table 4. 

It is not difficult to predict that the performance of the FShopper is highly 
dependent on the ’variability’ (or ’fuzziness’) of the merchandise. The higher the 
fuzziness (which means more variety), the lower the score. As shown in Table 4, 
skirts and shoes are typical examples in which skirts scores 65% and shoes scores 
89%. Nevertheless, the average score is still over 81%. Note that these figures are 
only for illustration purposes, as human justification and product variety in actual 
scenarios do vary case by case. 



Table 4. Fitness value for the eight different product categories 



Product category 


Fitness Value % (FV) 


T-shirt 


81 


Shirt 


78 


Shoes 


89 


Trousers 


88 


Skirts 


65 


Sweater 


81 


Tablecloth 


85 


Napkins 


86 


Average score 


81 . 6 % 



5 Conclusion 

In this paper, an innovative intelligent agent-based Web-mining application, iJADE 
eMiner, is proposed. Based on the integration of neuro-fuzzy based Web-mining 
technology (FShopper) and intelligent visual data-mlning technology (FAgent) for 
automatic user authentication. It will hopefully provide a new era of Web-based data 
mining and knowledge discovery using intelligent agent-based systems. 
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Abstract. In recent years, due to the rapid growth of Internet usage, the prob- 
lem of how to avoid inappropriate Internet contents accessing becomes more 
and more important. To solve the problem, a Collaborative Rating System [3, 
4] based upon PICS protocol has been proposed. However, since the users usu- 
ally would like to consult the opinions of the user group with similar rating ten- 
dency rather than the common opinions from the majority, it means the opinion 
of second majority with sufficient number of voters should also be considered. 
So does third majority, and so on. In order to provide a characterized rating 
service, a Characterized Rating Recommend System is designed to provide 
more precise and proper rating service for each user. Also, in this work, a ques- 
tionnaire is designed to get users’ opinions, and some experimental results 
show that the system can provide acceptable rating service. 



1 Introduction 

In recent years, due to the rapid growth of Internet usage, the problem of how to avoid 
inappropriate Internet contents accessing becomes more and more important. The 
concept of content selection is proposed to solve this problem, and there are many 
previous researches about content selection; e.g., PICS [5] protocol, which is proposed 
by W3C [8]. But there are still some problems; e.g., problem of rating information 
collecting. To solve this problem, a Collaborative Rating System [3, 4] has been pro- 
posed. However, users’ rating tendencies can not be considered in the system. In order 
to provide a Characterized Rating Service which take care of this problem, in this 
work, the opinions of participants will be first represented in well-structured data. 
Rating Vectors, and these Rating Vectors will be clustered into Rating Groups, corre- 
sponding to different rating opinions. Then the properties of each Rating Groups will 
be mined by using Rating Decision Tree Constructing Algorithm. To prevent the 
problem of over-fitting in a decision tree, a Precision and Support Based Decision 
Pruning Algorithm will be applied. Finally, the rules about Rating Groups generated 
will be used to provide users characterized rating services. Based on these concepts, a 
Characterized Rating Recommend System is designed. In the experiment, a question 
naire is designed to efficiently get opinions about content rating. 700 participants are 
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asked to answer the questionnaire, 616 of the filled questionnaires are useable in the 
clustering without missing data. Some experimental results with cross validation will 
also be shown in this paper. 



2 Related Works 

In order to solve the problem of how to avoid inappropriate Internet content, many 
researches are proposed [1, 3, 4, 7], and among these researches. Collaborative Rating 
System [3, 4] based upon PICS [5] protocol, provides practical solution. And Charac- 
terized Rating Recommend System is proposed to make Collaborative Rating System 
more adaptive to different user requirements. 



2.1 PICS, Platform for Internet Content Selection [5] 

To solve the problems of selecting appropriate or desired content via the Internet, 
many researches have been proposed. In these researches, the PICS [5] protocol was 
proposed by W3C[8] organization and provided a systematic architecture for docu- 
ment rating system, and it also provides the methods of rating information collecting. 
In PICS protocol, the rating information is provided by two methods, self-labeling and 
third-party labeling. In self-labeling method, the rating information is provided by the 
content providers of each web page. In third-party labeling, the rating information is 
provided by specific groups or organizations instead of the content provider. However, 
these rating information collecting methods seem to be too weak, since there is no 
obligation for content providers to provide the rating, it is impossible to rate all docu- 
ments by few voluntary or non-profit organizations, and it is hard to design an accept- 
able automatic rating system. 



2.2 Collaborative Rating System [3, 4] 

To solve the issues on rating information collecting, a Collaborative Rating System [3, 
4] was proposed, which collects rating information by the help of huge amount of 
volunteers. In Collaborative Rating System, participants are asked to rate web contents 
as they browse web pages according to a selected rating category, and their ratings 
will be collected and used to conclude a more objective result. The attributes of col- 
lected rating data consist of: 



Category 


Web Page 


User 


Level 



In the collected rating data. Category represents the selected rating category of 
this rating, and Web Page indicates the address of target web page. The information in 
User attribute records who had made the rating, and Level attribute is the rated level in 
the selecting rating category the user thought. The rating data of the same web page 





A Characterized Rating Recommend System 43 



collected from huge amount of users will be used to conclude result rating levels for 
each web page. Collaborative Rating System provides a more practical method for 
rating information collecting, and reduces the effort of each volunteer and organiza- 
tion to construct real non-profit rating system. 



3 Characterized Rating Recommend System 



Since different users may have other opinions than the majority; it means a unique 
rating result may not satisfy the needs of all users. In this section, a Characterized 
Rating Recommend System is designed to provide recommendation on document 
rating according to the characters of users in a collaborative rating system. The archi- 
tecture of the Characterized Rating Recommend System can be shown in Fig. 1 . 




R^ing dat^r 
result of 
quesitonaire 




Data 

Preprocessing' 



Rating Vector 

u, I I I I I I I I m 
u, I I I I I I I I I n 
U3 I I I I I I I I I n 



|u„ I I I I I I I I m 




Fig. 1. The architecture of Characterized Rating Recommend System 



3.1 Rating Data Preprocessing 

In a Collaborative Rating System [3,4], volunteers provide ratings to web pages they 
browsed. Besides, the ratings a user made can be structured into well-formatted Rating 
Vector to represent his/her rating opinion. A Rating Vector consists of ratings of a user 
to specific web pages which are browsed by many participants. The difference in 
rating vectors of each participant will be mapped to the difference of their rating 
opinions, and this property can be used to cluster different user groups. For k given 
popular web pages, the participants who have rate all these pages will be selected and 
their information will be analyzed in the system. The rating vectors of these partici- 
pants can be constructed by arranging each participant’ s ratings for these k web pages 
in a specific order. 
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3.2 Rating Group Clustering and Selecting 

As we have defined, each element of rating vector of a participant is the rating he/she 
has made for a specific web page, and assume the rating options provided are orderly 
arranged according to the difference of meaning, for example, from harmless to dan- 
gerous. It means the numeric difference in the same element of Rating Vectors of 
different users will correspond to the difference of opinions to the web page. And then 
the difference of Rating Vector can be easily defined by this concept. 

For Rating Vectors, we can then apply clustering algorithm to find the rating 
groups representing different rating opinions. K-means algorithm [2] will be used here 
to cluster Rating Vectors into clusters, and only these clusters including participants 
more than a threshold will be selected. After clustering and selecting. Rating Vectors 
will be clustered into several different Rating Groups. 



3.3 Rating Group Character Analyzing 

Each Rating Vector corresponds to a unique participant, and a Rating Group can be 
thought as a group of users with similar rating tendency. It seems that some common 
properties of the users’ characters in Rating Groups can be analyzed. A symbolic 
learning approach will try to conclude some common characters form the characters of 
users in the Rating Groups. Among symbolic learning algorithms, decision tree algo- 
rithm [6] is used due to the flexibility of handling the training data containing both 
symbolic and numeric data. To apply the decision tree algorithm, the participants of 
different Rating Group are treated as different kinds of samples used in decision tree 
algorithm, and the characters of participants are used for decision tree learning. After 
applying the algorithm, the constructed model can be further used to analyze the prop- 
erties of users for different Rating Groups. 



3.4 Precision and Support Based Decision Tree Pruning 

In order to prevent the problem of over- fitting and find the overall trend in users’ 
characters to their rating tendencies, not only precision is concerned, but also the sup- 
port of rules should also be considered. In order to generate rules with more support, a 
Precision and Support Based Decision Tree Pruning Algorithm is proposed in this 
section, and the detail of the algorithm is shown in Algorithm 1 . 

Notations: 

E( n ) The formula to evaluate the expected error ratio below node n in a tree. 

Su( n } The formula to evaluate the expected support ratio below node n in a tree. 

G Gain value of a node, the sum of E( ) and Su( ) of a node. 

Gsum Sum of Gain value of all child nodes. 
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3.5 Rating Recommend Module 

From the pruned decision tree, the rules about Rating Groups will be generated [6], 
and the rules will be used for rating recommending. After all the rules of Rating 
Groups are generated, the users of rating services can be partitioned into different 
groups, so does the rating information. For the users using rating service, a recom- 
mend can be made based on his/her characters, and corresponding rating information 
by evaluating ratings of the same group. With this mechanism, users may get rating 
information which is more adapted to their opinions, since the rating information 
comes from users who have similar characters to him/her. 

Algorithm 1: Precision and Support Based Decision Tree Pruning Algorithm 

For each non-leaf node p in Decision Tree from the bottom of Decision Tree 
For each sub node b oin 

G = Su{h) + £(b) ; 

Gsum = Gsum + G 

endFor 

If Gsum > Su(p) + E(p) then 

Prune the node p from G; 

endlf 

endFor 



4 Experiments and Implementation 



In the experiment, a questionnaire was designed to efficiently collect the users’ opin- 
ions to analyze. In out experiment, 700 users are asked to answer the questionnaire, 
and 616 of the filled questionnaires are usable in the clustering without missing data. 
The rating vectors of each questionnaire of the same category are clustered by k- 
means algorithm. Then the decision tree algorithm and corresponding pruning algo- 
rithm are applied to find rules about the attributes of users in this cluster, and the rules 
are generated from the constructed decision tree model. The results of experiment are 
shown in following tables: 



Table 1. Experiment results of both training set and test set 







Training: Set 


1 Test Set | 




Number of 
Rules 


Avg. Precision 


Avg. Support 


Avg. Precision 


Avg. Support 


1st set 


11 


64.15 % 


39.18/431 


62.13 % 


16.82/185 


2nd set 


8 


58.79 % 


53.88/431 


54.75 % 


23.13/185 


3rd set 


10 


62.56 % 


43.1 /431 


57.56 % 


18.5/185 


4th set 


15 


65.47 % 


28.73 /431 


60.4 % 


12.33/185 
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5 Conclusion 



In this paper, users’ rating opinions can be represented in Rating Vectors using the 
rating data collected from Collaborative Rating System, and clustered into several 
Rating Groups which correspond to different rating tendencies. And then a decision 
tree algorithm is proposed to find common characters of participants of Rating 
Groups. Besides, a Precision and Support Based Decision Pruning Algorithm will be 
applied to prune tree branches by considering both precision and support. Finally, the 
generated rules from pruned decision tree will be used to provide users characterized 
rating services and construct a Characterized Rating Recommend System. In the ex- 
periment, a questionnaire was designed to get users’ rating opinions efficiently. Some 
experimental results of proposed system showed that the proposed system can provide 
acceptable precision and rating recommending service. 



Table 2. Experiment results before and after pruning 





Before Pruning 


After Pruning 




Avg. Precision 


No. of rules 


Avg. Precision 


No. of rule 


1st set 


85.16% 


15 


64.15 % 


11 


2nd set 


82.75 % 


13 


58.79 % 


8 


3rd set 


88.13 % 


17 


62.56 % 


10 


4th set 


84.27 % 


23 


65.47 % 


15 
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Abstract. Many documents such as Web documents or XML files have 
no rigid structure. Such semistructured documents have been rapidly in- 
creasing. We propose a new method for discovering frequent tree struc- 
tured patterns in semistructured Web documents. We consider the data 
mining problem of finding all maximally frequent tag tree patterns in 
semistructured data such as Web documents. A tag tree pattern is an 
edge labeled tree which has hyperedges as variables. An edge label is a 
tag or a keyword in Web documents, and a variable can be substituted by 
any tree. So a tag tree pattern is suited for representing tree structured 
patterns in semistructured Web documents. We present an algorithm for 
hnding all maximally frequent tag tree patterns. Also we report some 
experimental results on XML documents by using our algorithm. 



1 Introduction 

Web documents have been rapidly increasing as the Information Technologies 
develop. Our target for knowledge discovery is the Web documents which have 
tree structures such as documents on World Wide Web or XML/SGML files. 
Such Web documents are called semistructured data Q- order to extract 
meaningful and hidden knowledge from semistructured Web documents, we need 
first to discover frequent tree structured patterns from them. 

In this paper, we adopt a variant of Object Exchange Model (OEM, for short) 
in fP for representing semistructured data. For example, we give an XML file 
xmLsample and a labeled tree T as its OEM data in Fig. [D Many real semistruc- 
tured data have no absolute schema fixed in advance, and their structures may be 
irregular or incomplete. As knowledge representations for semistructured data, 
for example, the type of objects [Zj , tree-expression pattern m and regular path 
expression jS] were proposed. In 0, we presented the concept of term trees as 
graph patterns suited for representing tree-like semistructured data. A term tree 
is pattern consisting of variables and tree-like structures. A term tree is different 
from other representations proposed in mm in that a term tree has structured 
variables which can be substituted by arbitrary trees. 
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( Fruits ) 

( Neune ) 

watermelon 
( /Name ) 

( Shape ) 
sphere 
{ /Shape ) 

( Shape ) 
large 
( /Shape ) 

( /Fruits ) 

xml sample 



<Fruits> 




<Name>y<Sh ip^<Shape> 

i 



watei melon 



large 



Fig. 1. An XML file xmLsample and a labeled tree T as its OEM data. 



In 1 ^ , we gave the knowledge discovery system KD-FGS which receives graph 
structured data and produces a hypothesis by using Formal Graph System 0 as 
a knowledge representation language. In |^, we designed an efficient knowledge 
discovery system having polynomial time matching algorithms and a polynomial 
time inductive inference algorithm from tree-like semistructured data. The above 
systems find a hypothesis consistent with all input data or a term tree which 
can explain a minimal language containing all input data, respectively. These 
systems work correctly and effectively for complete data. However, for irregular 
or incomplete data, the systems may output obvious or meaningless knowledge. 
In this paper, in order to obtain knowledge efficiently from irregular or incom- 
plete semistructured data, we define a tag tree pattern which is a special type 
of a term tree. In Fig. El for example, the tag tree pattern p matches the OEM 
data oi and 02, but p does not match the OEM data 03. 

The purpose of this work is extraction of tree structured patterns from tree 
structured data which are regarded as positive samples. So, overgeneralized pat- 
terns explaining the given data are meaningless. Finding least generalized pat- 
terns explaining the given data, which are called maximally frequent tag tree 
patterns, is reasonable. We propose a method for discovering all maximally fre- 
quent tag tree patterns. To do this, we present an algorithm which generates all 
maximally frequent tag tree patterns by employing the algorithm in E] which 
generates the canonical representations of all rooted trees. By using this algo- 
rithm, we can exclude meaningless tag tree patterns and avoid missing useful 
tag tree patterns. And we report some experimental results on XML documents 
by using our algorithm. 



2 Tag Tree Patterns and Data Mining Problems 

Let T = {Vt,Et) be a rooted unordered tree (or simply tree) with an edge 
labeling. A variable in Vr is a list [u,u'] of two distinct vertices u and u' in Vr- 



Semistructured Web Documents 



49 



<fruits> 


<Fiuils> 


<Fruils> 









<Natne> <SI«pe> <Shape> 



<Name> 



N^ne> 



<3^a|ne> .<Name> <Xame>, ^ 



wai««melon sphere I large melon I 



I green strawberr)' I raspbeny blueberry I 



Ol 



02 



03 



Fig. 2. A tag tree pattern p which matches OEM data oi and 02 but does not match 
OEM data 03. 



A label of a variable is called a variable label. A and X denote a set of edge 
labels and a set of variable labels, respectively, where A C\ X = (f>. A triplet 
g = {Vg, Eg, Hg) Is Called rooted term tree (or simply term tree) if (Vg,Eg) is 
a tree, Hg is a finite set of variables, and the graph {Vg,Eg U A') is a tree 
where Eg = {{m, w} | G Hg}. A term tree g is called regular if all variables 
have mutually distinct variable labels in X. Let / and g be term trees with at 
least two vertices. Let a = [u,u'\ be a list of two distinct vertices in g. The 
form X := [g,cr] is called a binding for x. A new term tree f{x := [ 5 , cr]} is 
obtained by applying the binding x := [g, a] to / in the following way: Let 
ei = = [vnii v'm] the variables in / with the variable label x. 

Let 51 , . . . , (/m be m copies of g and Ui, u'^ the vertices of gi corresponding to u, u' 
of g. For each variable e^, we attach gi to / by removing the variable Cj = [vi, v'j\ 
from E[f and by identifying the vertices Vi,v[ with the vertices Ui,u'i of gi. Let 
the root of the resulting term tree be the root of /. A substitution 6* is a finite 
collection of bindings {xi := [gi, ai], ■ ■ ■ , Xn := [g„, cr„]}, where x/s are mutually 
distinct variable labels in X. 

Tag Tree Patterns. Let Apag and Akw be two languages which contain in- 
finitely many words where Apag H Akw = 0- We call words in Apag and Akw 
a tag and a keyword, respectively. A tag tree pattern is a regular term tree such 
that each edge label on it is any of a tag, a keyword, and a special symbol “?” . A 
tag tree pattern with no variable is called a ground tag tree pattern. For an edge 
of a tag tree pattern and an edge {u,u'} of a tree, we say that {v,v'} 
matches {u,u'} if the following conditions (l)-(3) hold: (1) If the edge label of 
{x,u'} is a tag then the edge label of {u,u'} is the same tag or a tag which 
is considered to be identical under an equality relation on tags. (2) If the edge 
label of {v, v'} is a keyword then the edge label of {u, u'} is a keyword and the 
label of {u, u'} appears as a substring in the edge label of {u,u'}. (3) If the edge 
label of {u,u'} is “?” then we don’t care the edge label of {u,u'}. 

A ground tag tree pattern tt = (K-,if^,0) matches a tree T = (Vt,Et) if 
there exists a bijection ip from W to Vr such that (i) the root of tt is mapped to 
the root of T by ip, (ii) {v,v'} G if and only if {ip{v) , <p{v')} G Ep, and (iii) 
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for all {v^v'} € {v,v'} matches {Lp{v) , ip{v')} . A tag tree pattern tt matches 

a tree T if there exists a substitution 9 such that 7t0 is a ground tag term tree 
and ttO matches T . Then language which is the descriptive power of a tag 

tree pattern tt, is defined as L{tt) = {a tree T \ tt matches T}. 

Data Mining Problems. A set of semistructured data V — {Ti, P2, • ■ • > 7m} is 
a set of trees. The matching count of a given tag tree pattern tt w.r.t. T>, denoted 
by matchx>{TT), is the number of trees Ti G T> (1 < i < m) such that tt matches 
7}. Then the frequency of tt w.r.t. T> is defined by suppuiTr) = matchx>{Tc) / m. Let 
cr be a real number where 0 < cr < 1. A tag tree pattern tt is a-frequent w.r.t. T> 
if suppt>{tt) > a. We denote by II {L) the set of all tag tree patterns such that all 
edge labels are in L. Let Tag he a, finite subset of Axag and KW a finite subset 
of Akw- a tag tree pattern tt G II{Tag U KW U {?}) is maximally a-frequent 
w.r.t. 2? if (1) 7T is cr-frequent, and (2) if L(tt') C L{7t) then tt' is not tr-frequent 
for any tag tree pattern tt' G II{TagLlKWLl{ 7 }). In Fig.|^ for example, we give 
a maximally |-frequent tag tree pattern p in 7T({ (Fruits), (Name), (Shape)} U 
{melon} U {?}) with respect to OEM data oi, 02 and 03. The tag tree pattern p 
matches oi and 02, but p does not match 03. 

All Maximally Frequent Tag Tree Patterns Problem 

Input: A set of semistructured data T>, a threshold 0 < ct < 1, and finite sets 
of edge labels Tag and KW. 

Problem: Generate all maximally cr-frequent tag tree patterns w.r.t. T> in 
n{TagUKWU{?}). 

We gave a polynomial time algorithm for finding one of maximally cr-frequent 
tag tree patterns ||. Here we propose an algorithm for generating all maximally 
cr-frequent tag tree patterns with at most n vertices by generating all canonical 
level sequences of trees with n vertices |5|. Let T be a tree with n vertices. A 
level sequence £{T) = [£i €2 • • • In] is obtained by traversing T in preorder, and 
recording the level (=depth-|-l) of each vertex as it is visited. The canonical level 
sequence ofT, denoted by £{T)*, is the lexicographically last level sequence of 
T. In order to prune the hypothesis space, we define a function Vp for reducing a 
given level sequence £(T) with n elements to a level sequence with n — 1 elements 
as follows: Let q' be the leftmost position following p such that £q> < £p. If there 
is not such a position, let g' be n -I- 1 for convenience. If the pth vertex has only 
one child, we define rp{£{T)) to be [£i • • • €p_i £p+\ — 1 • • • £qi_i — l£q' ■ ■ ■ £„], 
if the pth vertex is a leaf such that £p > £p+i or p = n, then rp{£{T)) = 
[£\ • • • £p-i £p+i • • • £„], otherwise rp{£{T)) is undefined. 

Given a set of semistructured data T>, let n be the maximum number of 
vertices of trees in I). We repeat the following three steps for fc = 1 , . . . , n: Let 

be the set of all cr-frequent tag tree patterns with at most k vertices and no 
edge. Let Ilf (L) be the set of all cr-frequent tag tree patterns with at most k 
vertices and edge labels in L. 

Step 1. Generate the canonical level sequences of all tag tree patterns with k 
vertices. For each canonical level sequence of length k, we determine whether or 
not (rp(2(7r)))* is in IIf_.^ for each p = 2, ..., k. If there is p such that {rp{£{Tr)))* 
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Experiment (frequency a) 


Exp.l (cr=0.3) 


Exp. 2 (cr=0.5) 


max # of vertices in TTPs 
# of max freq TTPs 
run time (secs) 


2 3 4 5 6 7 

1 2 4 9 15 34 

7 32 159 630 1948 6162 


2 3 4 5 6 7 

1 2 2 4 3 5 

9 39 107 312 627 2721 
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Fig. 3. Experimental results for generating all maximally frequent tag tree patterns 
and maximally a-frequent tag tree patterns obtained in the experiments. 



is not in Ilk-i, tt is not cr-frequent, otherwise we compute the frequency of tt 
and if the frequency is greater than or equals to a then we add tt to 11^. 

Step 2. For each tt £ 7T^, we try to substitute variables of tt with edges 
labeled with “?” as many as possible so that all cr-frequent tag tree patterns in 
7T^({?}) are generated. This work can be done in a backtracking way. Then for 
each TT G .fffe ({?}), we try to replace ?’s with labels in Tag U KW as many as 
possible so that all cr-frequent tag tree patterns in U^{Tag U KW U {?}) are 
generated. This work can be done in a backtracking way. 

Step 3. Finally we check whether or not tt £ {TagUKWU{?}) is maximally 
cr-frequent. Let g he a, tag tree pattern ({ui, U2, U3}, 0, {[ui, U2], [u2, Ms]}). Let 
&i = {x ■■= [5, [mi,M2]]|, 6 »f = {x := [g,[u 2 ,U 3 ]]}, and 6 »f = {x := [g, [mi.ms]]}, 
for each variable x appearing in tt. If there exists a variable labeled with x such 
that 7T0f is cr-frequent, then tt is not maximal, otherwise we can conclude that 
TT is maximally cr-frequent. 



3 Implementation and Experimental Results 

We have implemented the algorithm for generating all maximally frequent tag 
tree patterns on a SUN workstation Ultra-10 with clock 333 MHz. We report 
some experiments on a sample file of semistructured data. The sample file is 
converted from a sample XML file about garment sales data such as xml^ample 
in Fig. [n The sample file consists of 32 tree structured data. The maximum 
number of vertices in a tree in the file is 58, the maximum depth is 3 and the 
maximum number of children of a vertex is 6. In the experiments described in 
Fig.13 we gave the algorithm “<Quarter>” and “<Description>” as tags, and 
“Summer” and “Shirt” as keywords. The algorithm generated all maximally cr- 
frequent tag tree patterns w.r.t. the sample file for a specified minimum frequency 
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a. We can set the maximum number (“max ^ of vertices in TTPs”) of vertices 
of tag tree patterns in the hypothesis space. 

We explain the results of Fig. |3 by taking the last column of Experiment 
Exp. 1 as an example. The specified minimum frequency a is 0.3. The total 
number ( of max freq TTPs” ) of all maximally u-frequent tag tree patterns 
with at most 7 vertices is 34. The total run time is 6162 secs. One of such 
maximally frequent patterns is shown in Fig. El 

4 Conclusions 

In this paper, we have considered knowledge discovery from semistructured Web 
documents such as XML files. We have proposed a tag tree pattern which is 
suited for representing tree structured pattern in such semistructured data. We 
have given an algorithm for solving All Maximally Frequent Tag Tree Patterns 
Problem. We have reported some experimental results by applying our algorithm 
for a sample file of semistructured Web documents. 
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Abstract. Text categorization presents unique challenges due to 
the large number of attributes present in the data set, large number 
of training samples, attribute dependency, and multi-modality of 
categories. Existing classification techniques have limited applicability 
in the data sets of these natures. In this paper, we present a Weight Ad- 
justed fc-Nearest Neighbor (WAKNN) classification that learns feature 
weights based on a greedy hill climbing technique. We also present two 
performance optimizations of WAKNN that improve the computational 
performance by a few orders of magnitude, but do not compromise 
on the classification quality. We experimentally evaluated WAKNN 
on 52 document data sets from a variety of domains and compared 
its performance against several classification algorithms, such as C4.5, 
RIPPER, Naive-Bayesian, PEBLS and VSM. Experimental results on 
these data sets confirm that WAKNN consistently outperforms other 
existing classification algorithms. 

Keywords: text categorization, fc-NN classification, weight adjustments 



1 Introduction 



We have seen a tremendous growth in the volume of online text documents avail- 
able on the Internet, digital libraries, news sources, and company- wide intranet. 
Automatic text categorization Euncni, which is the task of assigning text doc- 
uments to prespecified classes (topics or themes) of documents, is an important 
task that can help people finding information on these huge resources. 

Text categorization presents unique challenges due to the large number 
of attributes present in the data set, large number of training samples, at- 
tribute dependency, and multi-modality of categories. Existing classification al- 
gorithms [f iSf‘2pti|f l)f‘24] address these challenges to varying degrees [Zj . 
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fc-nearest neighbor (fc-NN) classification is an instance-based learning algo- 
rithm that has shown to be very effective for a variety of problem domains in 
which underlying densities are not known In particular, this classification 
paradigm works well in the data sets with multi-modality. It has been applied 
to text categorization since the early days of research PSl , and has been shown 
to produce better results when compared against other machine learning algo- 
rithms such as C4.5 m and RIPPER 0. In text categorization task, the class 
of a new text document is determined by computing the similarity between the 
test document and individual instances of the training documents, and deter- 
mining the class based on the class distribution of the nearest instances. A major 
drawback of this algorithm is that it uses all the features while computing the 
similarity between a test document and training documents. In many text data 
sets, relatively small number of features (or words) maybe useful in categoriz- 
ing documents, and using all the features may affect performance. A possible 
approach to overcome this problem is to learn weights for different features (or 
words) . 

In this paper, we present a Weight Adjusted fc-Nearest Neighbor (WAKNN) 
classification that learns weights for words. WAKNN finds the optimal weight 
vector using an optimization function based on the leave-one-out cross validation 
and a greedy hill climbing technique. WAKNN has better classification results 
than many other classifiers, but it has a high computational cost. One of the 
key challenges of the weight adjustment algorithm is how to reduce the high 
computational cost. We present two performance optimizations of WAKNN in 
this paper. The first optimization intelligently selects words used for weight ad- 
justment. The second optimization reduces the computational cost involved with 
the evaluation of weight adjustments by clustering documents within each class. 
Experimental results show that these two optimizations do not compromise the 
quality of the classification, but improve the computational performance by a 
few orders of magnitude. We experimentally evaluated WAKNN-C, which is the 
performance-improved version of WAKNN, on 52 document data sets from a 
variety of domains and compared its performance against several classification 
algorithms, such as C4.5 m, RIPPER PI, Naive-Bayesian m, PEBLS Pj, and 
Variable-Kernel Simulation Metric (VSM) [TH). Experimental results on these 
data sets show that WAKNN-C consistently outperforms other existing classifi- 
cation algorithms. 

Section PI describes challenges of text categorization and provides a brief 
overview of existing algorithms for text categorization. Section P| presents the 
weight adjustment algorithm WAKNN and its performance improvements. Sec- 
tion PJshows experimental evaluation of the WAKNN. Finally, Section p] provides 
conclusions and directions for future research. 



2 Previous Work 

Text categorization is essentially a classification problem The words 

occurring in the document sets become variables or features for the classification 
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problem. The class profile or model is described based on these words. There are 
several classification schemes that can potentially be used for text categorization. 

A classification decision tree, such as C4.5 m. is a widely used classification 
paradigm that has been shown to produce good classification results, primarily 
on low dimensional data sets. Decision tree based schemes like C4.5 or rule 
induction algorithms such as C4.5rules CHI and RIPPER j2j are not very effective 
in text data sets due to overfitting [7] . The overfitting occurs, because the number 
of samples is relatively small with respect to the distinguishing words, which 
leads to very large trees with limited generalization ability. 

The Naive-Bayesian (NB) classification algorithm has been widely used for 
document classification, and has been shown to produce very good performance 
orra . Even though Naive-Bayes classification techniques, such as Rainbow m, 
are commonly used in text categorization m, the independence assumption 
limits their applicability in the document classes and unimodal density assump- 
tions might not work well in document data sets with multi-modal densities |S|. 

There have been several approaches to learn feature weights for fc-NN. PE- 
BLS PI and fc-NN with mutual information are approaches that compute 

weights of features in prior to the classification learning. These approaches com- 
pute the importance of a feature independent of all the other features. Variable- 
Kernel Simulation Metric (VSM) PS| learns the feature weights using non-linear 
conjugate gradient optimization. VSM has a very structured approach to find 
weights, but requires optimization functions to be differentiable and does not 
have the convergence guarantees like the linear conjugate gradient optimization. 
RELIEF-F [I2| is another weight adjustment technique that learns weights based 
on the nearest neighbors in each class. 

Support Vector Machines (SVM) is a new learning algorithm that was in- 
troduced to solve two-class pattern recognition problem using the Structural 
Risk Minimization principle m- An efficient implementation of SVM and its 
application in text categorization of Reuters-21578 corpus is reported in [Tllj . 



3 Weight Adjusted fc-Nearest Neighbor Classification 
(WAKNN) 

The Weight Adjusted fc-Nearest Neighbor Classification (WAKNN) tries to learn 
the best weight vector for classification. Given the weight vector W, the simi- 
larity between two documents X and Y is measured using the weighted cosine 
measure m-- 



cos{X,Y,W) 



X Wt) X {Yt X Wt) 

AT.t^AXt X Wtr X x ’ 



where T is the set of terms (or words), Xt and Yj are normalized text frequency 
(TF) of word t for X and Y, respectively, and Wt is the weight of word t. 

Figure Q illustrates the objective of weight adjustment in the classification 
task. In the original data without weights, the test sample of class A is equally 
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close to the training samples in class A and class B. After weight adjustment, it 
is much closer to the training samples in class A and thus is correctly classified. 

We can visually see the ef- 
fect of weight adjustment on 
real data sets using the class- 
preserving projection |^. This 
projection tries to find the opti- 
mal 2-dimensional display of high- 
dimensional data sets with class 
labels. This projection works the 
best when there are 3 classes in 
the data sets, because 3 class 
means automatically determine a 
plane for 2-dimensional display. Hence we selected 3 classes from the training 
data set of westl, which is described in Section El and learned weights using 
WAKNN. FigureElshows the original data set on the left and the weight-adjusted 
data set on the right. FigureElshows that WAKNN was able to find weights that 
can separate data points in different classes. For instance, data points in class 
“010” (depicted as “x” in the figure) are almost completed separated from the 
other two classes. Classes “008” (depicted as “-I-”) and “054” (depicted as “*”) 
are also well separated except several points mixed up around the coordinate 
(-0.1, -0.2). 



Test sample 
X 




Cla.ss (' 

(a) before weight adjustment 



Test sample 
X 




(b) after weight adjustment 



Fig. 1. Weight Adjustment in fc-NN Classifica- 
tion. 
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Fig. 2. 2-dimensional display of westl data set before and after weight adjustments. 



The weight adjustment process is a search process in the word weight vector 
space. The key design decisions in this process include which optimization func- 
tion to use, which search method to use, and how to make the algorithm efficient 
in terms of run time and memory requirements. In the rest of this section, we 
will discuss these design decisions and present the algorithm. 
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3.1 Optimization Function and Search Strategy 

A natural choice for optimization function in the fc-NN classification paradigm 
is a function based on the leave-one-out cross validation. The overall value for 
the optimization function is computed based on the contribution of each train- 
ing sample used in the cross validation. Each training sample finds /c-nearest 
neighbors using a similarity function with the current weight vector. The con- 
tribution of each training sample in the optimization function depends on the 
class labels of these fc-nearest neighbors and similarities to these neighbors. The 
training sample should contribute to the overall optimization function only if 
it is correctly classified based on its fc-nearest neighbors and its classification 
is clear cut. One way of achieving this goal is to use the following contribution 
function. Let Z? be a set of training documents and W he a, weight vector. Let 
C be a set of classes of D and class{x) be a function that returns the class of 
document x. For a training sample d G D, let = {ni, ri 2 , . . . , n^} be the set 
of fc-nearest neighbors of d. Given Nd, the similarity sum of d’s neighbors that 
belong to class c is: 

Sc= ^ cos{d,rii,W) (1) 

ni^Nd-,class{rii)—c 

Then the total similarity sum of d is: 

cec 

The contribution of d, C ontribution{d) , is defined in terms of Sc of classes c G C 
and T : 

ContribuUon{d) = ( 1 if Vc e C, c ^ class{d), Sciass(d) > Sc and > p 

[ 0 otherwise 

In this contribution function, a document d contributes to the overall optimiza- 
tion function only if the sum of similarities of d’s neighbors with class classed) 
is greater than that of any other class and is at least fraction p of the total 
similarity sum. Finally, the optimization function is defined as: 

OPT(D VFl = Sdgg Contribution{d) 

^ ’ \D\ 

Different Contribution{d) functions describe different optimization functions. 
We have experimented with several contribution functions 0 and chosen the 
above contribution function for WAKNN. The optimization function based on 
this contribution function allows the weight adjustment to be flexible in terms 
of local minima and overfitting. If the optimization function uses a low p (close 
to 0.0), then the weight adjustment will tend to overfit the training documents 
but will move out of possible local minima more easily. On the other hand, if the 
optimization function uses high p (close to 1.0), then the weight adjustment will 
tend to avoid overfitting but have tendency to be trapped in a local minima. 
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Several space search techniques are available including hill-climbing (greedy) 
search El, best-first search El, beam search El, or bidirectional search El- 
WAKNN searches for the best weight vector using hill-climbing (greedy) method. 
We chose a hill-climbing (greedy) search strategy, because this approach is com- 
putationally fast and memory efficient. In the greedy search of WAKNN, the 
weight for each word is perturbed slightly and its effect on the optimization 
function is evaluated. The perturbation of the weight is proposed by multiplying 
the current weight by different factors. The weight perturbation with the most 
improvement in the optimization function is chosen as the winner and its weight 
is updated. This process is repeated until no further progress is obtained. 



3.2 Performance Improvements of WAKNN 

The major drawback of WAKNN is its high computational cost. After each pos- 
sible weight adjustment of a word in the greedy search, the merit of the possible 
weight adjustment is evaluated using the optimization function OPT(D,W). 
The cost of this evaluation is 0{n?) where n is the number of training documents, 
because finding the k nearest neighbors of each training document requires com- 
parison to the rest of the training documents. Hence the computational cost of 
each iteration of the WAKNN algorithm is 0{mn?) where m is the number of 
words in the document data sets. We propose two computational performance 
improvements of WAKNN. The first method (WAKNN-F) improves the perfor- 
mance by reducing the number of words used for the weight adjustment, i.e, 
reducing m in 0{mn^). The second method (WAKNN-C) improves the per- 
formance by reducing the cost of evaluating OPT{D,W), i.e., reducing n in 
0{mn^). 



WAKNN-F. In order to select words for weight adjustment, we need a rank- 
ing method and a selection method. We use mutual information to rank 

words. The mutual information can be computed in 0{nl) where n is the num- 
ber of training documents and I is the average length of the document. In the 
single scan of training documents, we can compute class distributions of each 
word. Based on this information, the mutual information of each word can be 
computed. The preliminary experiments show that only 10 to 70 weights are 
changed under WAKNN and most of these words ranked in the top 1000 ac- 
cording to the mutual information of words. These results show that the ranking 
method based on mutual information could be effective in finding words that are 
important in weight adjustment. We also consider pairwise word dependency in 
the ranking by computing mutual information of the pair of words. The mu- 
tual information of a word is then determined as the maximum of the mutual 
information of the single word and any pair of words with this word in it. 

There are three desired goals of the selection method. We want to include as 
many top ranked words as possible. In the same time, we want to make sure that 
the selected words cover as many documents as possible. It is possible that a 
small set of documents contain many words that have high mutual information. 
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If we select words based purely on the mutual information, we will not be able 
to improve the classification accuracies of documents not in this set. Finally, we 
want to select as small number of words as necessary to control the computational 
cost. 

We propose the following scheme to achieve these three goals. The overall 
steps of the scheme is shown in Figure |3 We rank words globally according to 
the mutual information. For each document, we sort its words according to the 
global ranking. We have chosen an incremental approach to determine the right 
level of coverage of documents. We start collecting top 10% (MinCov = 0.1) 
of the words according to the global ranking in each document. The union of 
all these words constitute the selected words. By doing this, we guarantee that 
words selected cover at least 10% of the words in each document. The fc-NN 
classification accuracy of training documents using only these selected words 
is calculated. This accuracy is compared to the baseline classification accuracy 
that is computed by using all the words. If this ratio is less than a user specified 
minimum ratio (MinRatio), the selection process is repeated with the minimum 
coverage (MinCov) incremented by 0.05 (5%). The selection mechanism stops as 
soon as the classification power of the selected words is at least MinRatio of the 
classification power of all the words. The MinRatio controls the total number of 
words selected. 



SelectWord(D, MinRatio) 

1. Rank words in training document D according to mutual information. 

2. For each d ^ D, sort words in d according to the global word ranking. 

3. BaseAcc = Accuracy(F), W) 

4. AccRatio = 0.0; MinCov = 0.1; Selected = {}; 

5. While (AccRatio < MinRatio) 

5.1 for each d ^ D 

Selected = Selected U {top (MinCov * 100.0) % of words in d} 

5.2 Acc = Accuracy(F), Selected) 

5.3 AccRatio = Acc / BaseAcc 

5.4 MinCov — MinCov + 0.05 

6. return Selected 



Fig. 3. Major steps of word selection scheme. 



The proposed scheme guarantees that distinguishing or important words are 
selected, all the documents have minimum coverage, and selected words have 
classification capacity of at least MinRatio of the full set of words. However, 
when the classification using all the words is very poor, this scheme will select 
very few words and stop. We fixed this problem by selecting at least 500 words. 
We call this optimized version of WAKNN as WAKNN-F. 

WAKNN-C. One obvious choice for reducing the computational cost of 0{n^) 
part of the overall computational cost is by sampling training data. However, 
sampling in the presence of large variations of class size is troublesome. Further- 
more, this might mitigate the advantage of fc-NN classification. fc-NN performs 
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well when there is multi-modal class distribution in which summarization of the 
whole class does not accurately depict the class distribution. In this multi-modal 
distribution, A:-NN works well because each data point can find a small set of 
neighbors that are in the same class and are very close. Random sampling could 
thin out these neighbors such that each data point can be pulled toward other 
classes in fc-NN classification. 

We have chosen to cluster training documents. We cluster similar documents 
within each class and represent each cluster by a centroid. Then in the optimiza- 
tion function, we find A:-nearest centroids instead of fc-nearest neighbors. If we 
have c clusters, then we will have computational complexity of 0(cn) instead of 
0{n^). Assuming that we can find c<^n, the computational cost would decrease 
dramatically. 

In the top down clustering of documents within each class, we start with each 
class being a single cluster. We select the next cluster to divide based on the 
recall of each cluster. The recall of each cluster is defined as follows. For each 
cluster, we first find training samples (regardless of their class labels) that have 
the centroid of this cluster as the closest centroid. The recall for the cluster is 
defined as the percentage of these training samples that have the same class as 
the cluster. 

We pick the cluster with the worst recall and divide this cluster into two 
clusters, and then perform the refinement similar to fc-means clustering algo- 
rithm Pj with the weighted cosine similarity measure for assigning documents 
to clusters. Note that refinement of clusters is performed among clusters of the 
same class. This approach gives the natural stopping point: we stop clustering 
when all the clusters have perfect recalls. We augment this stopping criterion by 
forcing each class to have at least k clusters. 

As the weight adjustment progresses, the weight changes can cause the clus- 
ter to change. After each weight change, we refine the clusters with new weights 
and repeat the recall-based clustering process to enforce that each cluster has 
a perfect recall. In general, the number of clusters increases as the weight ad- 
justment progresses. It is entirely possible that better weights can allow clusters 
to be merged and yet have a perfect recall. In the current implementation this 
possibility has not been explored. We call this optimization of WAKNN-F as 
WAKNN-C. 

The major steps of WAKNN-C are shown in Figure 2] In steps 1, 2 and 3, 
training document matrix is constructed and the matrix is normalized to mitigate 
the effect of different document lengths. In step 8.3, the weight for each word 
is perturbed slightly and its effect on the optimization function is evaluated. 
The perturbation of the weight is proposed by multiplying the current weight 
by different factors. The weight perturbation with the most improvement in the 
optimization function is chosen as the winner and its weight is updated. This 
process is repeated until no further progress is obtained. 

The differences between WAKNN-F and WAKNN are steps 5 and 8.3. Instead 
of checking all the words for weight changes, WAKNN-F only checks the words 
selected from the Select Word function. The difference between WAKNN-C and 
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1. Construct a training matrix D. 

- each row corresponds to a training document 

- each column corresponds to a word 

- value in the matrix corresponds frequency of word j in document i. 

2. Let F be a multiplication factors and TV be a weight vector. 

3. Normalize word frequencies in each document such that they add up to 1.0. 

4. For each j, Wj — 1.0. 

5. Selected — SelectWord(D) 

6. C ^ FindClusters(D, W) 

7. bestopt — OPT{D ^W^C)\ oldopt — 0; 

8. While (bestopt > oldopt) 

8.1 oldopt = bestopt 

8.2 bestword = -1; newval = -1 

8.3 For each word j C Selected 

For each f € F 
W 

Wj - Wj X / 

{OPT{D,W\C) > bestopt) 
bestword = j 
bestval — Wj 
bestopt ^ OPT{D,W\C) 

8.4 ^^hestword — bestval 

8.5 RefineClusters(D, C, W) 



Fig. 4. Major steps of WAKNN-C. 



WAKNN-F are step 6 in the algorithm for computing initial clusters and step 
8.5 for refining clusters after a weight change. Another difference is the definition 
of neighbors in the computation of similarity sum in Eqd Now the is the 
set of fc-nearest clusters of d. fc-nearest clusters of d is determined by computing 
similarity to the centroids of clusters. Sc is computed based on the similarities 
between d and the centroids of fc-nearest clusters. 

4 Experimental Results 

In all of the data sets, we have used stop words to remove common words and 
stemmed words using Porter’s suffix-stripping algorithm m- For the data rep- 
resentation, we have followed the vector space model commonly used in Infor- 
mation Retrieval systems m- 

We have used 52 total data sets for experiments. More detailed information 
on these data sets is available in □Q Data sets westl, west2, . . . , west! are 
from the statutory collections of the legal document publishing division of West 
Group described in p. Data sets trll, trl2, . . . , tr45, fbis, treed, are derived 
from TREC-5 |25, TREC-6 |2H, and TREC-7 [2H collections. The classes of the 
various TREC data sets were generated from the relevance judgment provided 
in these collections. Data sets ohO, ohl, . . . , ohl9 are from OHSUMED collec- 
tion 0 subset of MEDLINE database. We took different subsets of categories 
to construct these data sets. Data sets reO and rel are from Reuters-21578 text 



^ These data sets are available from http://www.cs.umn.edu/~han/data/tmdata.tar.gz. 
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categorization test collection Distribution 1.0 m- We removed dominant classes 
such as “earn” and “acq” that have been shown to be relatively easy to classify. 
We then divided the remaining classes into 2 sets. Data set wap is from the 
WebACE project (WAP) p. Each document corresponds to a web page listed 
in the subject hierarchy of Yahoo!. 

We compare WAKNN against C4.5, RIPPER, Rainbow, fc-NN, PEELS, k- 
NN with mutual information (MI), RELIEF-F, and VSM. We have not included 
SVM in the comparison, because the SVM code PH available was designed for 
two class problem. We implemented fc-NN, MI, and RELIEF-F; we used publicly 
available implementations of C4.5, RIPPER, Rainbow, PEELS, and VSM. In 
all the fc-NN based algorithms (WAKNN-C, A:NN, MI, RELIEF-F, and VSM) 
except PEELS, the number of neighbors A: = 10 is used. For PEELS, we used the 
nearest neighbor (i.e. k = 1), because the results using more than one neighbor 
were significantly worse. For Rainbow, we used the multinomial event model as 
the document representation model, because results from other works j 1 6124) 
indicate that this event model is better than the multi-variate Eernoulli event 
model and our preliminary experiments also confirmed this claim. For other 
parameters in these algorithms, we used default parameter values suggested by 
the developers of the algorithms. 

In the first experiment, we compared the classification accuracy of WAKNN 
to other classifiers on a subset of documents described above. In these experi- 
ments, we used p — 0.5 in the optimization function and {0.2, 0.5, 0.8, 1.5, 2.0, 
4.0, 9.0, 15.0, 30.0, 50.0 } as multiplication factors for weight perturbation. Out 
of 19 data sets, WAKNN has the best classification accuracy on 13 data sets. 
None of the other classifiers has the best classification accuracy on more than 
one data set. Even for 6 data sets that WAKNN did worse than some other 
classification algorithms, the classification accuracy of WAKNN was within less 
than 2% of the best results. 

Even though WAKNN provides better classification accuracies over other 
classifiers, the computational cost of WAKNN is very high. The runtime of 
WAKNN varied from a few hours on 300 to 500 training sample size (westl 
and tiTl) to a few days on 1000 training sample size (oh8, ohl2, and ohl8). 
WAKNN-F and WAKNN-C significantly reduced the runtime of WAKNN. For 
instance, WAKNN-F reduced the runtime of WAKNN by a factor of 2 to 6. 
WAKNN-C further reduced runtime dramatically in large data sets (oh8, ohl2, 
and ohl8) compared to WAKNN-F. The runtime of WAKNN-C on these data 
sets ranged from 19 to 25 minutes, whereas the runtime of WAKNN-F on the 
same data set ranged from 5 to 8 hours. While WAKNN-F and WAKNN-C re- 
duced the runtime of WAKNN significantly, WAKNN-F and WAKNN-C have 
statistically equal quality classification according to the sign test pn|7j and two 
sample significance test pniT] . 

We now present the classification accuracy of WAKNN-C compared to other 
classifiers on all 52 data sets in Table E We used MinRatio=0.75 for selecting 
words in WAKNN-C. The number of words selected ranged from one half to one 
tenth of the original number of words in the data sets. The minimum coverage 
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Table 1. Classification accuracies of different classifiers. Note that the highest accuracy 
for each data set is highlighted with bold font. 





C4.5 


RIPPER 


Rainbow 


kNN 


PEBS 


MI 


RELIEF-F 


VSM 


WAKNN-C 


westl 


82.40 


84.87 


84.40 


76.73 


78.50 


86.80 


76.87 


85.27 


87.40 


west 2 


74.20 


72.22 


72.11 


68.33 


67.80 


75.89 


68.44 


76.11 


81.78 


wests 


75.40 


78.48 


79.92 


67.83 


67.00 


79.71 


68.03 


79.92 


85.86 


west 4 


80.50 


73.88 


79.61 


67.26 


70.70 


74.96 


67.80 


75.49 


83.54 


wests 


84.20 


80.35 


88.73 


77.78 


83.40 


90.02 


78.10 


87.12 


94.04 


wests 


76.60 


78.14 


85.66 


72.27 


74.30 


82.92 


72.40 


85.11 


86.89 


west? 


79.90 


76.05 


74.35 


67.46 


68.40 


78.08 


67.68 


78.64 


81.69 


trll 


79.70 


79.23 


85.99 


85.02 


81.20 


85.51 


85.02 


83.58 


85.99 


trl2 


86.00 


82.17 


80.25 


82.17 


79.60 


86.62 


82.17 


87.90 


87.90 


trlS 


88.90 


88.36 


87.69 


93.10 


87.70 


92.02 


93.10 


92.56 


92.69 


trl4 


88.90 


90.40 


91.64 


92.88 


79.30 


93.19 


92.88 


90.71 


95.05 


trlS 


94.90 


93.57 


97.75 


94.86 


94.20 


97.43 


94.86 


98.07 


97.43 


tr21 


80.50 


82.25 


60.95 


81.66 


75.70 


80.47 


81.66 


80.47 


90.53 


tr22 


88.60 


87.40 


91.06 


93.09 


84.60 


93.50 


93.09 


91.06 


92.28 


tr23 


93.20 


93.20 


73.79 


82.52 


79.60 


89.32 


82.52 


76.70 


85.44 


tr24 


92.50 


85.63 


73.12 


88.75 


90.00 


86.25 


88.75 


86.25 


96.25 


tr25 


83.40 


91.72 


85.80 


80.47 


72.80 


81.07 


80.47 


84.02 


88.76 


trSl 


90.50 


87.93 


92.46 


91.59 


86.90 


95.69 


91.59 


92.46 


94.61 


trS2 


86.80 


83.72 


78.68 


75.97 


71.70 


75.58 


75.97 


80.62 


90.31 


trSS 


89.50 


90.83 


67.69 


86.46 


86.50 


90.39 


86.46 


89.52 


95.63 


trS4 


94.00 


87.99 


77.74 


88.34 


85.20 


85.51 


88.34 


86.57 


93.99 


trSS 


87.90 


90.00 


87.50 


88.93 


78.90 


91.79 


88.93 


90.00 


96.07 


tr41 


90.70 


95.67 


94.08 


89.07 


86.30 


89.07 


89.07 


92.71 


94.99 


tr42 


89.70 


92.31 


87.91 


85.35 


79.50 


87.55 


85.35 


91.58 


94.87 


tr43 


95.30 


89.67 


88.73 


83.57 


83.60 


84.04 


83.57 


88.73 


92.96 


tr44 


86.60 


79.77 


89.31 


83.97 


80.50 


82.06 


83.97 


80.92 


85.88 


tr45 


87.90 


83.53 


84.68 


86.99 


70.80 


85.55 


87.28 


88.73 


96.53 


fbis 


57.10 


74.84 


76.38 


78.49 


69.80 


76.54 


78.49 


74.67 


83.52 


trec6 


67.50 


82.79 


92.16 


91.99 


84.30 


88.42 


91.99 


87.56 


94.21 


ohO 


84.70 


84.06 


90.21 


87.08 


55.60 


85.73 


86.98 


77.19 


90.73 


ohl 


81.20 


79.82 


84.37 


86.03 


37.50 


81.93 


86.03 


82.48 


87.36 


oh2 


88.90 


85.31 


91.57 


90.57 


76.10 


87.49 


90.48 


86.40 


92.38 


oh3 


85.60 


81.58 


88.69 


86.03 


55.10 


87.40 


86.03 


84.83 


91.35 


oh4 


82.00 


86.33 


92.07 


90.52 


58.20 


86.87 


90.43 


82.04 


93.53 


ohS 


76.70 


72.73 


83.83 


84.06 


53.20 


81.94 


84.06 


80.87 


88.78 


oh6 


83.90 


84.43 


88.58 


90.71 


53.30 


82.64 


90.59 


81.86 


89.92 


oh7 


77.90 


76.33 


87.46 


85.80 


34.40 


81.30 


85.56 


73.49 


87.81 


ohS 


76.90 


71.60 


85.44 


82.40 


55.00 


80.95 


82.52 


79.85 


87.99 


oh9 


84.50 


76.90 


90.48 


88.81 


56.10 


83.10 


88.81 


83.33 


91.19 


Ohio 


74.00 


70.36 


79.69 


77.26 


39.90 


76.09 


77.26 


69.00 


84.55 


oh 11 


83.30 


84.32 


82.81 


80.93 


65.80 


85.74 


80.93 


82.91 


91.31 


ohl2 


82.80 


81.24 


86.63 


85.17 


73.90 


80.45 


85.28 


81.46 


88.88 


ohl3 


80.00 


80.55 


91.97 


89.64 


63.50 


86.36 


89.75 


86.05 


92.92 


ohl4 


82.10 


79.79 


88.51 


86.88 


43.50 


86.88 


86.97 


83.53 


90.13 


ohlS 


73.00 


76.67 


82.81 


80.25 


49.80 


75.11 


80.13 


77.12 


84.26 


ohl6 


87.70 


86.63 


92.15 


90.60 


74.80 


90.34 


90.68 


87.49 


93.70 


ohl7 


85.20 


80.16 


90.72 


89.04 


56.30 


86.77 


89.04 


85.59 


92.20 


ohl8 


78.60 


77.63 


82.44 


78.30 


56.60 


76.17 


78.41 


77.63 


86.02 


ohl9 


80.60 


77.49 


86.81 


85.70 


42.90 


80.82 


85.70 


81.49 


87.25 


reO 


66.90 


67.27 


76.45 


81.04 


67.90 


79.44 


81.44 


81.04 


78.44 


rel 


78.60 


77.27 


77.73 


74.35 


67.00 


76.19 


74.35 


80.49 


84.49 


wap 


71.30 


72.95 


83.97 


81.92 


41.80 


79.23 


81.92 


68.08 


84.23 
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of each document with selected words ranged from 10% to 85%. The number 
of clusters found ranged from one half to one tenth of the original number of 
training documents. Detailed parameter studies of WAKNN-C are available in 
0. Out of 52 data sets, WAKNN-C has the best classification accuracy on 38 
data sets. None of the other classifiers has the best classification accuracy on 
more than three data sets. These results show that WAKNN-C consistently 
outperforms other classifiers in most of the data sets. In order to verify whether 
these differences are statistically significant, we performed the sign test [2( )f7) 
and the two sample significance test prwT] . The two sample significance test 
shows that WAKNN-C is not statistically worse than any of the classifiers on 
any data set. The results also show that WAKNN-C is statistically better (P- 
value < 0.05) than C4.5, RIPPER, Rainbow, fc-NN, PEELS, MI, RELIEF-F, 
and VSM in 43, 39, 22, 34, 49, 41, 34 and 35 data sets respectively. According 
to the sign test, WAKNN-C is statistically better (with P-value < 0.05) than all 
the other classifiers 0 

5 Conclusions and Directions for Future Research 

In this paper, we presented a Weight Adjusted A:-Nearest Neighbor (WAKNN) 
classification that retains the power of the fc-NN while further enhancing its 
ability by learning feature weights. Experimental results show that WAKNN-C 
outperforms existing state of the art classification algorithms quite consistently 
on the document data sets from a variety of domains. Even though the focus of 
this paper has been the application of WAKNN in the text categorization task, 
WAKNN is applicable in any domain with large number of attributes. 

Several issues that we identified in the course of this work deserve further 
attention. In the greedy search technique used in WAKNN, only one weight 
change is selected at a time. It will be rewarding to explore methods for changing 
multiple words at a time in the greedy search process. One possibility is to use 
association rules to identify set of words that are strongly related, and then 
use this knowledge in determining which sets of words to select for simultaneous 
weight adjustments. Another research focus is the incremental weight adjustment 
as new training samples are available. 
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Abstract. This paper introduces a class of predictive self-organizing 
neural networks known as Adaptive Resonance Associative Map 
(ARAM) for classification of free-text documents. Whereas most sta- 
tistical approaches to text categorization derive classihcation knowledge 
based on training examples alone, ARAM performs supervised learn- 
ing and integrates user-dehned classification knowledge in the form of 
IF-THEN rules. Through our experiments on the Reuters-21578 news 
database, we showed that ARAM performed reasonably well in mining 
categorization knowledge from sparse and high dimensional document 
feature space. In addition, ARAM predictive accuracy and learning ef- 
hciency can be improved by incorporating a set of rules derived from 
the Reuters category description. The impact of rule insertion is most 
signihcant for categories with a small number of relevant documents. 



1 Introduction 

Text categorization refers to the task of automatically assigning documents into 
one or more predefined classes or categories. It can be considered as the simplest 
form of text mining in the sense that it abstracts the key content of a free-text 
documents into a single class label. In recent years, there has been an increas- 
ing number of statistical and machine learning techniques that automatically 
generate text categorization knowledge based on training examples. Such tech- 
niques include decision trees P], K-nearest-neighbor system (KNN) [7^1 9) . rule 
induction |E|, gradient descent neural networks regression models HS|, 

Linear Least Square Fit (LLSF) ^2], and support vector machines (SVM) jOl 
Cj. All these statistical methods adopt a supervised learning paradigm. During 
the learning phase, a classifier derives categorization knowledge from a set of 
prelabeled or tagged documents. During the testing phase, the classifier makes 
prediction or classification on a separate set of unseen test cases. Supervised 
learning paradigm assumes the availability of a large pre-labeled or tagged train- 
ing corpus. In specific domains, such corpora may not be readily available. In a 
personalized information filtering application, for example, few users would have 
the patience to provide feedback to a large number of documents for training the 
classifier. On the other hand, most users are willing to specify what they want 
explicitly. In such cases, it is desirable to have the flexibility of building a text 
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classifier from examples as well as obtaining categorization knowledge directly 
from the users. 

In machine learning literatures, hybrid models have been studied to integrate 
multiple knowledge sources for pattern classification. For example, Knowledge 
Based Artificial Neural Network (KBANN) refines imperfect domain knowledge 
using backpropagation neural networks CBI; Predictive self-organizing neural 
networks allow rule insertion at any point of the incremental learning 

process. Benchmark studies on several databases have shown that initializing 
such hybrid learning systems with prior knowledge not only improves predictive 
accuracy, but also produces better learning efficiency, in terms of the learning 
time as well as the final size of the classifiers m- In addition, promising results 
have been obtained by applying KBANN to build intelligent agents for web page 
classification m- 

This paper reports our evaluation of a class of predictive self-organizing neu- 
ral networks known as Adaptive Resonance Associative Map (ARAM) ^21 for 
text classification based on a popular public domain document database, namely 
Reuters-21578. The objectives of our experiments are twofold. First, we study 
ARAM’s capability in mining categorization rules from sparse and high dimen- 
sional document feature vectors. Second, we investigate if ARAM’s predictive 
accuracy and learning efficiency can be enhanced by incorporating a set of rules 
derived from the Reuters category description. 

The rest of this article is organized as follows. Section 2 describes our choice of 
feature selection and extraction methods. Section 3 presents the ARAM learning, 
classification, and rule insertion algorithms. Section 4 reports the experimental 
results. The final section summarizes and concludes. 



2 Features Selection/Extraction 

As in statistical text categorization systems, we adopt a bag-of-words approach 
to representing documents in the sense that each document is represented by a 
set of keyword features. The keyword features can be obtained from two sources. 
Through rule insertion, a keyword feature can be specified explicitly by a user 
as an antecedent in a rule. Features can also be selected from the words in the 
training documents based on certain feature ranking metric. Some popularly 
used measures for feature ranking include keyword frequency, statistics, and 
information gain. In our experiments, we only use statistics which has been 
reported to be one of the most effective measures m- 

During rule insertion and keyword selection, we use an in-house morpholog- 
ical analyzer to identify the part-of-speech and the root form of each word. To 
reduce complexity, only the root forms of the noun and verb terms are extracted 
for further processing. 

During keyword extraction, the document is first segmented and converted 
into a keyword feature vector 



V = {vi,V2, ■ ■ ■ ,Vm)- 



(1) 
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where M is the number of keyword features selected. We experiment with three 
different document representation schemes described below. 
tf encoding: This is the simplest and the first method that we used in earlier 
experiments in The feature vector v simply equals the term frequency vector 
tf such that the value of feature j 



Vj = tfj (2) 

where tfj is the in-document frequency of the keyword Wj . 
tf*idf encoding: A term weighting method based on inverse document fre- 
quency mu is combined with the term frequency to produce the feature vector 

V such that 

N 

Vj = tfj log2 ^ (3) 

where N is the total number of documents in the collection and dfj is the number 
of documents containing the keyword Wj . 

log-tf*idf encoding: This is a variant of the tf* id f scheme. The feature vector 

V is computed by 

N 

Vj = {l + l0g2 tfj ) l0g2 ^ . (4) 

After encoding using one of the three feature representation schemes, the 
feature vector v is normalized to produce the final feature vector 

a = \-jvm where Vm >= Vi (5) 

before presentation to the neural network classifier. 

3 ARAM Algorithms 

ARAM belongs to a family of predictive self-organizing neural networks known 
as predictive Adaptive Resonance Theory (predictive ART) that performs in- 
cremental supervised learning of recognition categories (pattern classes) and 
multidimensional maps of patterns. An ARAM system can be visualized as two 
overlapping Adaptive Resonance Theory (ART) P| modules consisting of two 
input fields Ff and Fj with an F 2 category field. For classification problems, 
the Ff field serves as the input field containing the input activity vector and the 
Fj field servers as the output field containing the output class vector. The F 2 
field contains the activities of the recognition categories that are used to encode 
the patterns. 

In an ARAM network (Figure [Q, the unit for recruiting an F 2 category node 
is a complete pattern pair. Given a pair of input patterns, the category field F 2 
selects a winner that receives the largest overall input from the feature fields 
Ff and Fj. The winning node selected in F 2 then triggers a top-down priming 
on Fj and F^, monitored by separate reset mechanisms. Code stabilization is 
ensured by restricting encoding to states where resonances are reached in both 
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modules. By synchronizing the unsupervised categorization of two pattern sets, 
ARAM learns supervised mapping between the pattern sets. Due to the code 
stabilization mechanism, fast learning in a real-time environment is feasible. 

In addition, the knowledge that ARAM discovers during learning, is compat- 
ible with IF-THEN rule-based representation. Specifically, each node in the F 2 
field represents a recognition category associating the Ef input patterns with 
the Fi input vectors. Learned weight vectors, one for each F 2 node, constitute 
a set of rules that link antecedents to consequents. At any point during the 
incremental learning process, the system architecture can be translated into a 
compact set of rules. Similarly, domain knowledge in the form of IF-THEN rules 
can be inserted into ARAM architecture. 

3.1 Learning 

The ART modules used in ARAM can be ART 1 P], which categorizes binary 
patterns, or analog ART modules such as ART 2, ART 2-A, and fuzzy ART 
0, which categorize both binary and analog patterns. The fuzzy ARAM model, 
that is composed of two overlapping fuzzy ART modules (Figure ^), is described 
below. 




Fig. 1. The Adaptive Resonance Associative Map architecture. 

Input vectors: Normalization of fuzzy ART inputs prevents category prolif- 
eration. The Fi and Ef input vectors are normalized by complement coding 
that preserves amplitude information. Complement coding represents both the 
on-response and the off-response to an input vector a. The complement coded 
Ff input vector A is a 2M-dimensional vector 

A = (a,a°) = (ai,...,aM,a^---,aM) (6) 

where a^ = 1 — a^. 
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Similarly, the complement coded input vector B is a 2N-dimensional 
vector 

B = = (7) 

where b‘( = 1 — bi. 

Activity vectors: Let x“ and denote the and Fj’ activity vectors respec- 
tively. Let y denote the F 2 activity vector. Upon input presentation, x“ = A 
and x^ = B. 

Weight vectors: Each F2 category node j is associated with two adaptive 
weight templates w“ and w^. Initially, all category nodes are uncommitted and 
all weights equal ones. After a category node is selected for encoding, it becomes 
committed. 

Parameters: Fuzzy ARAM dynamics are determined by the choice parameters 
Oa > 0 and at > 0; the learning rates /3a G [0, 1] and (3b G [0, 1]; the vigilance 
parameters pa G [0, 1] and pb G [0, 1]; and the contribution parameter 7 G [0, 1]. 
Category choice: Given a pair of Ff and F^ input vectors A and B, for each 
F 2 node j, the choice function Tj is defined by 



T,=l 



|A A w“| 

+ |w“| 



+ (1 -7) 



|B Aw^^l 
ab + |w^| ’ 



where the fuzzy AND operation A is defined by 



(P A q)i = min{pt,qt), 
and where the norm |.| is defined by 

IpI = 

i 



(8) 

(9) 



( 10 ) 



for vectors p and q. 

The system is said to make a choice when at most one F 2 node can become 
active. The choice is indexed at J where 



Tj = max{Tj : for all F 2 node j}. 



( 11 ) 



When a category choice is made at node J, y,j = 1; and yj = 0 for all j ^ J . 
Resonance or reset: Resonance occurs if the match functions, and ruj, 
meet the vigilance criteria in their respective modules: 



rrij — 



Learning then ensues, as defined below. If any of the vigilance constraints is 
violated, mismatch reset occurs in which the value of the choice function Tj is 
set to 0 for the duration of the input presentation. The search process repeats 
to select another new index J until resonance is achieved. 
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Learning: Once the search ends, the weight vectors w“ and Wj are updated 
according to the equations 

^a(new) ^ ^ A (13) 

and 

^Knew) ^ ^ ^ ^b(old)^ 

respectively. For efficient coding of noisy input sets, it is useful to set / 9 a = / 3 h = 1 
when J is an uncommitted node, and then take /3q < 1 and /3(, < 1 after the 
category node is committed. Fast learning corresponds to setting /3q = / 9 b = 1 
for committed nodes. 

Match tracking: Match tracking rule as used in the ARTMAP search and 
prediction process ^ is useful in maximizing code compression. At the start of 
each input presentation, the vigilance parameter pa equals a baseline vigilance 
Fa- If a reset occurs in the category field F2, pa is increased until it is slightly 
larger than the match function m“. The search process then selects another F2 
node J under the revised vigilance criterion. With the match tracking rule and 
setting the contribution parameter 7 = 1, ARAM emulates the search and test 
dynamics of ARTMAP. 



3.2 Classification 

In ARAM systems with category choice, only the F2 node J that receives max- 
imal Fi — >■ F2 input Tj predicts ART;, output. In simulations, 

_ J 1 if j = J where Tj > for all /c J , , 

( 0 otherwise. 

The FI activity vector is given by 

x^ = =w(). (16) 

j 

The output prediction vector B is then given by 

B = ( 6 i,& 2 ,... 62 /v)=X^ (17) 

where bi indicates the confidence of assigning a pattern to category i. 

3.3 Rule Insertion 

ARAM incorporates a class of if-then rules that maps a set of input attributes 
(antecedents) to a disjoint set of output attributes (consequents). The rules are 
conjunctive in the sense that the attributes in the IF clause and in the THEN 
clause have an AND relationship. Conjunctive rules has limited expressive power 
but is intuitive and adequate for representing simple heuristic for categorization. 
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ARAM rule insertion proceeds in two phases. The first phase parses the 
rules for keyword features. When a new keyword is encountered, it is added 
to a keyword feature table containing keywords obtained through automatic 
feature selection from training documents. Based on the keyword feature table, 
the second phase of rule insertion translates each rule into a 2M-dimensional 
vector A and a 2N-dimensional vector B, where M is the total number of features 
in the keyword feature table and N is the number of categories. Given a rule of 
the following format, 

IF THEN Zi,Z2,...,Zp 

where xi, . . . , Xm are the positive antecedents, j/i, . . . , are the negative an- 
tecedents, and zi,...,Zp are the consequents, the algorithm derives a pair of 
vectors A and B such that for each index f = 1, . . . , M, 



where Wi is the entry in the keyword feature table; and for each index i = 



where Wi is the class label of the category i. 

The vector pairs derived from the rules are then used as training patterns 
to initialize a ARAM network. During rule insertion, the vigilance parameters 
Pa and pb are each set to 1 to ensure that only identical attribute vectors are 
grouped into one recognition category. Contradictory symbolic rules are detected 
during rule insertion when a perfect match in (m°j = 1) is assciated with a 
mismatch in {m^j < 1). 

4 Experiments: Reuters-21578 

Reuters-21578 is chosen as the benchmark domain for a number of reasons. First, 
it is reasonably large, consisting of tens of thousands of pre-classified documents. 
Second, there is a good mix of large and small categories (in terms of the number 
of documents in the category). It enables us to examine ARAM learning capa- 
bility and the effect of rule insertion using different data characteristics. The 
last but not the least, Reuters-21578 has been studied extensively in statistical 
text categorization literatures, allowing us to compare ARAM performance with 
prior arts. 

To facilitate comparison, we used the recommended ModApte split (Reuters 
version 3) j 1 1 1 1) j to partition the database into training and testing data. By 
selecting the 90 (out of a total of 135) categories that contain at least one 
training and one testing documents, there were 7770 training documents and 
3019 testing documents. 




(1. 0) if Wi = Xj for some j G {1, . . . , m} 
(0, 1) if Wi = yj for some j G {1, . . . ,n} 

(1.1) otherwise 



(18) 




if Wi = Zj for some j G {1, ■ ■ ■ ,p} 
otherwise 



(19) 
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4.1 Performance Measures 

ARAM experiments adopt the most commonly used performance measures, 
namely recall, precision, and the Fi measure. Recall (r) is the percentage of 
the documents for a given category (i.e. topic) that are classified correctly. Pre- 
cision (p) is the percentage of the predicted documents for a given category that 
are classified correctly. It is a normal practice to combine recall and precision in 
some way, so that classifiers can be compared in terms of a single rating. Two 
common ratings are the break-even point and the Fi measure. Break-even point 
is the value at which recall equals precision. Fi measure is defined as 

Fi{r,p) = (20) 

r -\- p 

These scores can be calculated for a series of binary classification experi- 
ments, one for each category, and then averaged across the experiments. Two 
types of averaging methods are commonly used: (1) micro- averaging technique 
that gives equal weight to each document; and (2) macro-averaging technique 
that gives equal weight to each category m- As micro-averaging Fi scores are 
computed on a per-document basis, they tend to be dominated by the classifier’s 
performance on large categories. Macro-averaging Fi scores, computed on a per- 
category basis, are more likely to be influenced by the classifier’s performance 
on small categories. 

4.2 Learning and Classification 

ARAM experiments used the following parameter values: choice parameters 
Oq = 0.1, Of, = 0.1; learning rates /3a = /?& = 1-0 for fast learning; contribu- 
tion parameter 7 = 1.0, and vigilance parameters Pa = 0.8, pb = 1.0. Using a 
voting strategy, 10 voting ARAM produced a probabilistic score between 0 and 
1 . The score was then thresholded at a specific cut off point to produce a binary 
class prediction. 

We fixed the number of keyword features at 100 determined empirically 
through earlier experiments. Null feature vectors and contradictory feature vec- 
tors were first removed from the training set before training. We experimented 
with all the three feature encoding schemes, namely tf, tfHdf, and log-tf*idf. Ta- 
ble 1 summarizes the performance of ARAM averaged across 90 categories of 
Reuters-21578. Among the three encoding schemes, log-tf*idf produced the best 
performance in terms of micro-averaged Fi. tf*idf however performed better in 
terms of macro-averaged Fi . 

4.3 Rule Insertion 

A set of IF-THEN rules was crafted based on a description of the 
Reuters categories provided in the Reuters-21578 documentation (cat- 
descriptions_120396.txt). The rules simply linked the keywords mentioned in 
the description to their respective category labels. Creation of such rules was 
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Table 1. Performance of ARAM using the three feature encoding schemes in terms 
of micro-averaged recall, micro-averaged precision, micro-averaged Fi and macro- 
averaged Fi across all 90 categories of Reuters-21578. 



Encoding Method miR miP miFi maFi 

If 0.8251 0.8376 0.8313 0.5497 

tfidf 0.8368 0.8381 0.8375 0.5691 

log-tf*idf 0.8387 0.8439 0.8413 0.5423 



rather straight-forward. A total of 150 rules was created without any help from 
domain experts. They are generally short rules containing one to two keywords 
extracted from the category description. A partial set of rules is provided in 
Table 121 for illustration. 

Table 2. An illustrative set of rules generated based on the Reuters category descrip- 
tion. 



acq 


- acquire acquisition 


acq 


- merge merger 


crude 


- crude oil 


grain 


- grain 


interest 


- interest 


interest 


- rate 


money-fx 


- foreign exchange 


money-fx 


- money exchange 



In the rule insertion experiments, rules were parsed and inserted into the 
ARAM networks before learning and classification. Table |2| compares the results 
obtained by ARAM (using log-tf*idf) with and without rule insertion on the 
10 most populated Reuters categories. The micro-averaged Fi and the macro- 
averaged Fi scores across the top 10 and all the 90 categories are also given. 
Eight out of the top 10 categories, namely acq, money-fx, grain, crude, inter- 
est, ship, wheat, and corn, showed noticeable improvement in F\ measures by 
incorporating rules. Interestingly, one category, namely trade, produced worse 
results. No improvement was obtained for earn, the largest category. The overall 
improvement on the micro-averaged F\ scores across the top 10 and all the 90 
categories were 0.004 and 0.011 respectively. The improvement obtained on the 
macro-averaged Fi scores, recorded at 0.006 for the top 10 and 0.055 for the 
90 categories, were much more significant. This suggests that rule insertion is 
most effective for categories with a smaller number of documents. The results 
are encouraging as even a simple set of rules is able to produce a noticeable 
improvement. 
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Table 3. Predictive performance of ARAM with and withont rule insertion on Reuters- 
21578. Epochs refers to the number of learning iterations for ARAM to achieve 100% 
accuracy on the training set. Nodes refers to the number of ARAM recognition cate- 
gories created. The last two rows show the micro-averaged Fi and the macro-averaged 
Fi across the top 10 and the 90 categories respectively. Boldfaced figures highlighted 
improvement obtained by rule insertion. 



Number of ARAM ARAM w/rules 



Category Documents Epochs Nodes Fi Epochs Nodes Fi 



earn 


1087 


4.5 


717.3 


0.984 


5.3 


722.4 


0.984 


acq 


719 


5.1 


732.0 


0.930 


6.1 


732.9 


0.938 


money-fx 


179 


4.2 


334.4 


0.750 


5.5 


336.0 


0.763 


grain 


149 


4.7 


104.5 


0.895 


5.0 


108.3 


0.906 


crude 


189 


6.7 


86.4 


0.802 


6.9 


86.7 


0.813 


trade 


117 


6.0 


257.2 


0.689 


6.8 


260.9 


0.661 


interest 


131 


4.7 


281.3 


0.727 


5.6 


287.4 


0.740 


ship 


89 


4.6 


46.8 


0.793 


4.8 


48.7 


0.796 


wheat 


71 


4.2 


87.7 


0.789 


8.1 


89.0 


0.803 


com 


56 


3.6 


89.6 


0.748 


8.0 


90.2 


0.765 



Top 10 {miFi,maFi) (0.897,0.811) (0.901,0.817) 

All 90 (^11^1,7/10^1) (0.841,0.542) (0.852,0.597) 



Tables compares ARAM results with top performing classification systems 
on Reuters-21578 m ARAM performed noticeably better than the gradient 
descent neural networks and the Native Bayes classifiers. Its miFi scores were 
comparable with those of SVM, KNN, and LLSF, but the maFi scores were sig- 
nificantly higher. As miFi scores are predominantly determined by the largest 
categories and miFi scores are dominated by the large number of small cate- 
gories, the results indicate that ARAM performs fairly well for large categories 
and outperforms in small categories. 



Table 4. Performance of ARAM compared with other top performing text classification 
systems across all 90 categories of Reuters-21578. 



Classifiers miR miP miFi maFi 

ARAM w/rules 0.8909 0.8155 0.8515 0.5967 

ARAM 0.8961 0.7922 0.8409 0.5422 

SVM 0.8120 0.9147 0.8599 0.5251 

KNN 0.8339 0.8807 0.8567 0.5242 

LLSF 0.8507 0.8489 0.8498 0.5008 

Gradient descent NNet 0.7842 0.8785 0.8287 0.3765 

Native Bayes 0.7688 0.8245 0.7956 0.3886 
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5 Conclusion 

This paper has presented a novel approach to incorporate domain knowledge 
into a learning text categorization system. ARAM can be considered as a scaled 
down version of KNN. Increasing the ART^ baseline vigilance parameter pa to 
1.0 would cause ARAM’s performance to converge to that of KNN with the price 
of storing all unique training examples. ARAM, therefore, is more scalable than 
KNN and is useful in situations when it is not practical to store all the training 
examples in the memory. Comparing with SVM, ARAM has the advantage of 
on-line incremental learning in the sense that learning of new examples does not 
require re-computation of recognition nodes using previously learned examples. 

The most distinctive feature of ARAM, however, is its rule-based domain 
knowledge integration capability. The performance of ARAM is expected to im- 
prove further as good rules are added. The rule insertion capability is especially 
important when few training examples are available. This suggests that ARAM 
could be suitable for on-line text classification applications such as document 
filtering and personalized content management. 
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Abstract. We investigate two meta- model approaches for the task of 
antomatic textual document categorization. The first approach is the 
linear combination approach. Based on the idea of distilling the charac- 
teristics of how we estimate the merits of each component algorithm, we 
propose three different strategies for the linear combination approach. 
The linear combination approach makes use of limited knowledge in the 
training document set. To address this limitation, we propose the second 
meta-model approach, called Meta-learning Using Docnment Featnre 
characteristics (MUDOF), which employs a meta-learning phase using 
document feature characteristics. Document feature characteristics, de- 
rived from the training document set, capture some inherent properties 
of a particular category. Extensive experiments have been conducted on a 
real-world document collection and satisfactory performance is obtained. 

Keywords: Text Categorization, Text Mining, Meta-Learning 



1 Introduction 

Textual document categorization aims to assign none or any number of appro- 
priate categories to a document. The goal of automatic text categorization is 
to construct a classification scheme, or called the classifier, from a training set. 
A training set contains sample documents and their corresponding categories. 
Specifically, there is a classification scheme for each category. During the train- 
ing phase, documents in training set are used to learn a classification scheme for 
each category by using a learning algorithm. After completing the whole training 
phase, each category will have a different learned classification scheme. After the 
training phase, the learned classification scheme for each category will be used 
to categorize unseen documents. 

There has been some research conducted for automatic text categorization. 
Yang and Chute m proposed a statistical approach known as Linear Least 
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Squares Fit (LLSF) which estimates the likelihood of the associations between 
document terms and categories via a linear parametric model. Lewis et al. 0 
explored linear classifiers for the text categorization problem. Yang m devel- 
oped an algorithm known as ExpNet which derives from the k-nearest neighbor 
technique. Lam et al. 0 attempted to tackle this problem using Bayesian net- 
works. Lam and Ho proposed the generalized instance set approach for text 
categorization. Joachims as well as, Yang and Liu m recently compared 
support vector machines with k-NN. Dumais et al. | 2 | compared support vector 
machines, decision trees and Bayesian approaches on the Reuters collection. All 
the above approaches developed a single paradigm to solve the categorization 
problem. 

In the literature, several methods on multi-strategy learning or combination 
of classifiers have been proposed. Chan and Stolfo P presented their evaluation 
of simple voting and meta-learning on partitioned data, through inductive learn- 
ing. Recently, several meta-model methods have been proposed for text domains. 
Yang et al. ^ proposed the Best Overall Results Generator (BORG) system 
which combined classification methods linearly for each classifier in the Topic 
Detection and Tracking (TDT) domain. Larkey et al. 0 reported improved per- 
formance, by using new query formulation and weighting methods, in the context 
of text categorization by combining different classifiers. Hull at al. examined 
various combination strategies in the context of document filtering. 

Instead of using only one algorithm, meta-model learning involves more than 
one categorization algorithm. Under the approach, classification schemes that 
have been separately learned by different algorithms for a category, are combined 
together in a certain way, to yield one single meta-model classification scheme. 
Given a document to be categorized, the meta-model classification scheme can 
be used for deciding the document membership for the category. As a result, 
each meta-model classifier for a category is the combined contributions of all the 
involved algorithms. 

All existing meta-model approaches for text categorization are based on lin- 
ear combination of several basic algorithms. In this paper, we investigate the 
linear combination approach by distilling the characteristic of how we estimate 
the relative merit of each component algorithm for different categories. Based 
on this idea, we propose three different strategies for the linear combination ap- 
proach. The linear combination approach makes use of limited knowledge in the 
training document set. To address this limitation, we propose a second meta- 
model approach, called Meta-learning Using Document Feature characteristics 
(MUDOF), which employs a meta-learning phase using document feature charac- 
teristics. Document feature characteristics, derived from the training document 
set, capture some inherent properties of a particular category. This approach 
aims at recommending algorithms automatically for different categories. 

We have conducted extensive experiments on a real-world document collec- 
tion The results demonstrate that our new approaches of meta-learning models 
for text categorization outperforms all other component algorithms under vari- 



ous measures. 
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2 Linear Combination Approach 

The first approach we investigate is based on the weighted sum of linear com- 
bination of classifiers. Under this approach, the contribution of each individual 
component algorithm j to the final meta-model classification scheme for a cat- 
egory i, is represented by a weight factor Wij . Consider a document m which is 
to be categorized. Instead of using the score calculated from a single classifica- 
tion scheme of a particular category, the linear combination approach calculates 
a combined score which is the weighted sum of contributions of all component 
algorithms in a linear fashion. Suppose there are n component algorithms. The 
combined score for m is computed by Equation Q1 

n 

= ( 1 ) 

where 5™ is the final combined score for m for the category i. S'™ is the score 
calculated between m and the classifier learned by algorithm j for category i. 
The value of w™ is the weight factor, or the contribution, of the classifier to the 
score S™, and equals to 1. 

If the final combined score for m is larger than the threshold value of a cat- 
egory, that category is assigned to m. To reflect the significance of contribution 
by different classifiers for a category, various methods can be employed to de- 
termine the weight in Equation nj. We have implemented three strategies 
under this linear combination approach, to study the categorization performance 
differences due to the use of different weight determination for the combination. 

2.1 Equal Weighting Strategy 

The first strategy, called LCl, is an equal weighting scheme. Under this scheme, 
the weight of all classifiers are the same, as indicated in Equation El As a result, 
the contribution of each classification algorithm to the final combined score for 
m is equal. 

< = ^S = --- = < = --- = < = - foralH (2) 

2.2 Weighting Strategy Based on Utility Measure 

The second strategy, called LC2, determines the weighting scheme based on 
utility measure from training. Under this strategy, the relative contribution, 
re™, of a classification scheme, which is constructed by algorithm j for category 
z, to the final combined score for document to, depends on the performance, 
Uij, of the learned classifier in the training phase. The relationship between the 
contribution of the classifier and its categorization performance, is represented 
as a function indicated in Equation El 

= fiuij) 



(3) 
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where function / is expressed in terms of Uij , which is the utility score obtained 
by the classifier. 

The function / is a transformation function from certain utility scores to 
corresponding contribution weights. The transformation is restricted by the con- 
dition of equals to 1. Conceptually, a well-performed classification 

scheme constructed by an algorithm should be given a heavier weight than the 
others during the score combination. In our investigation, we adopt the function 
/ as shown in Equation 0 

= fiuij) = for 1 < j < n (4) 

Z^fc=l '^ik 

We make use of a set of documents, called tuning set, obtained from a subset 
of the training set to calculate Uij. Specifically, utj is the classification perfor- 
mance of the tuning set using the classification scheme constructed by algorithm 
j for category i. 

2.3 Weighting Strategy Based on Document Rank 

Our third strategy, called LC3, determines the contribution weights of the in- 
volved component algorithms, based on the rank of scores. S'*™ for document 
m. The scores are first ranked. By mapping from the rank i? to a set of pre- 
determined weight factors using the function g, a particular weight, say Pd, is 
assigned to the corresponding algorithm as its contribution in the final combined 
score for m. The idea of this strategy is illustrated in Equation 0 

= Pd = g{.RTj) for Pd&{Pi,P2,..., Pn} and 1 < J < n (5) 

where Pd is one of the n pre-determined weights, and is the rank of score of 
771 by algorithm j under category i. g is a, mapping function from the rank 
to the assignment of the weight Pd for the document m. 

3 Meta-learning Using Document Feature Characteristics 
(MUDOF) 

We propose our second approach of the meta-learning framework for text catego- 
rization, based on multivariate regression analysis, by capturing category specific 
feature characteristics. In MUDOF, there is a separate meta-learning phase us- 
ing document feature characteristics. Document feature characteristics, derived 
from the training set of a particular category, can capture some inherent proper- 
ties of that category. Different from existing categorization methods, instead of 
applying a single method for all categories during classification, this new meta- 
learning approach can automatically recommend a suitable algorithm during 
training and tuning steps, from an algorithm pool, for each category based on 
the category specific statistical characteristics and multivariate regression anal- 
ysis. The problem of predicting the expected classification error of an algorithm 
for a category can be interpreted as a function of feature characteristics. 
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In particular, we wish to predict the classification error for a category based 
on the feature characteristics. This is achieved by a learning approach based on 
regression model, in which, the document feature characteristics are the inde- 
pendent variables, while the classification error of an algorithm is the dependent 
variable. Feature characteristics are derived from the categories. We further di- 
vide the training collection into two sets, namely the training set and the tuning 
set. Two sets of feature characteristics are collected separately from training set 
and tuning set. Statistics from training set are for parameter estimations. To- 
gether with the estimated parameters, the statistics from tuning set are used for 
predicting the classification error of an algorithm for a category. The algorithm 
with the minimum estimated classification error for a category will be recom- 
mended for that category during the testing, or validation, phase. Classification 
errors need to undergo a logistic transformation to yield the response variable, 
or the dependent variable, for the meta-model. The transformation ensures that 
the fitted error to be in the range from 0 to 1. Consider the ith category and the 
jth algorithm. The response variable, yij is related to the feature characteristics 
by the regression model, as shown in Equation 0 

V,, = In + (6) 

where e^- is the classification error, obtained for the ith category by using the 
jth algorithm. is the A:th feature characteristic in the ith category. The 
number of feature characteristics used in the meta-model is p. (3^ is the parameter 
estimate for the fcth feature, by using algorithm j. is assumed to follow an 
N{0,var{eij)). Based on the regression model above, the outline of meta-model 
for text categorization is given in Figure ^ 

Step 1 to 9, in Figure 0 aims to estimate a set of betas the parameter 
estimates of the feature characteristics in the regression model, for each indi- 
vidual algorithm. In Step 2, an algorithm, with optimized parameter settings, 
is picked from the algorithm pool. By repeating Step 3 to 7, the algorithm is 
applied on training and tuning examples to yield classification errors of the clas- 
sifier for all categories. Documents in tuning set, as shown in Step 5, are used 
for obtaining the classification performance, and so the classification error, of a 
trained classifier for each category. A set of betas, belonging to the algorithm 
being considered, can be obtained by fitting all classification errors of the cat- 
egories, and their corresponding feature characteristics in the training set, into 
the regression model. After Step 9, there will be j sets of estimated parameters, 
the betas, which are then used for the subsequent steps. 

The predictions on the classification errors of the involved algorithms are 
made from Step 10 to Step 16. In Step 12, one algorithm with the same opti- 
mized parameter settings as in Step 2, is picked from the algorithm pool. The 
corresponding set of betas of the selected algorithm, together with the feature 
characteristics of a category in the tuning set, will be fitted into the regression 
model, in Step 13, to give the estimated classification errors of the algorithm 
on the category. Decisions, about which algorithm will be applied on the cate- 
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Input: The training set TR and tuning set TU 

An algorithm pool A and categories set C 



1) Repeat 

2) Pick one algorithm ALGj from A. 

3) For each category Ci in C 

4) Apply ALGj on TR for Gi to yield a classifier GFij. 

5) Apply GFij on TU for Gi to yield classihcation error Cij. 

6) Take logistic transformation on eij to yield yij for later parameter 
estimation. 

7) End For 

8) Estimate f3j (A:=0,l,2,...,p) for ALGj by fitting yij and E* (in TR) 

into the regression model. 

9) Until no more algorithms in A. 

10) For each category Ci in C 

11) Repeat 

12) Pick one algorithm ALGj from A. 

13) Estimate the classification error Cij by fitting (i’j and corresponding 
Ef (in TU) into the regression model. 

14) If e)j is minimum, recommend ALGj for Ci as the output. 

15) Until no more algorithms in A. 

16) End For 



Fig. 1. The Meta-Model algorithm. 



gory, are based on the predicted minimum classification errors in Step 14. After 
Step 16, classification algorithms are recommended for categories, and the rec- 
ommended algorithm will be applied to each category during the validation step. 
The whole process, from parameter estimation to recommending algorithms for 
categories, of our proposed meta-model approach is fully automatic. 

4 Experiments and Empirical Results 

4.1 Document Collection and Experimental Setup 

Extensive experiments have been conducted on the Reuters-21578 corpus, which 
contains news articles from Reuters in 1987. 90 categories are used in our ex- 
periments. We divided the 21,578 documents in the Reuters-21578 document 
collection according to the ’’ModApte” split into one training collection of 9603 
documents, and one testing collection of 3299 documents. The remaining 8,676 
documents are not used in the experiments as the documents are not classified 
by human indexer. For those meta-models requiring a tuning set, we further 
divided the training collection into training set of 6000 documents and 3603 
tuning documents. For each category, we used the training document collection 
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to learn a classification scheme. The testing collection is used for evaluating the 
classification performance. 

Six component classification algorithms have been used in our meta-model 
approaches. They are Rocchio, WH, k-NN, SVM, GIS-R and GIS-W, with op- 
timized parameter settings. These are six recent algorithms, each of which ex- 
hibits certain distinctive nature: Rocchio and WH are linear classifiers, k-NN 
is instance-based learning algorithm, SVM is based on Structural Risk Mini- 
mization Principle jOl and both GIS-R and GIS-W 0 are based on generalized 
instance approach. 

In MUDOF, seven feature characteristics are used in our regression model as 
independent variables: 

1. PosTr : The number of positive training examples of a category. 

2. PosTu: The number of positive tuning examples of a category. 

3. AvgDocLen: The average document length of a category. Document length refers 
to the number of indexed terms within a document. The average is taken across 
all the positive examples of a category. 

4. AvgTermVal'. The average term weight of documents across a category. Average 
term weight is taken for individual documents hrst. Then, the average is taken 
across all the positive examples of a category. 

5. AvgMaxTermVal: The average maximum term weight of documents across a 
category. Maximum term weight of individual documents are summed, and the 
average is taken across all the positive examples of a category. 

6. AvgMinTermVal'. The average minimum term weight of documents across a cate- 
gory. Minimum term weight of individual documents are summed, and the average 
is taken across all the positive examples of a category. 

7. AvgTermThre: The average number of terms above a term weight threshold. The 
term weight threshold is optimized and set globally. Based on the preset threshold, 
the number of terms with term weight above the threshold within a category are 
summed. The average is then taken across all the positive examples of the category. 

Two sets of normalized feature characteristics are collected separately from 
training set and tuning set. As illustrated in Step 8 and Step 13 in Figure Q 
the feature characteristics from these two data sets serve different purposes in 
the meta-model: feature characteristics from training set are combined for pa- 
rameters estimation, while feature characteristics from tuning set are used for 
predicting classification errors, base on which algorithms are recommended. 

To measure the performance, we use both micro-averaged recall and precision 
break-even point measure (MBE) |H|, as well as the macro-averaged recall and 
precision break-even point measure (ABE). In micro-averaged recall and preci- 
sion break-even point measure, the total number of false positive, false negative, 
true positive, and true negative are computed across all categories. These totals 
are used to compute the micro-recall and micro-precision. Then we use the in- 
terpolation to find the break-even point. In macro-averaged recall and precision 
break-even point measure, break-even point for individual category is calculated 
first, and the simple average of all those break-even points is taken across all the 
categories to obtain the final score. 
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4.2 Empirical Results 

After conducting extensive experiments for each component algorithm in order 
to search for the most optimized parameters setting, we further conduct exper- 
iments for the linear combination approach and MUD OF. 

Table ^ shows the micro-recall and precision break-even point measure of 
our proposed meta- learning models. Table 0shows the classification performance 
improvement, based on the micro-recall and precision break-even point measure, 
obtained by the meta-learning models under linear combination and MUDOF 
approach. Meta-learning models demonstrate improvement over all component 
algorithms in various extent. The improvement over Rocchio is the largest for all 
approaches. In particular, the LC2 strategy obtains the best improvement than 
the other strategies under linear combination approach. 



Table 1. Micro-recall and precision break-even point measures over 90 categories for 
the meta-learning models. 



II LCl LC2 LC3 MUDOF 
MBE||0.860 0.862 0.858 0.858 



Table 2. Classification improvement by meta-model approaches over component al- 
gorithms based on micro-recall and precision break-even point measures over 90 cate- 
gories. 



ALG 


MBE 


LGl-t(%) LG2-t(%) LC3-b(%) MUDOF-t(%) 


RO 


0.776 


10.825 


11.082 


10.567 


10.567 


WH 


0.820 


4.878 


5.122 


4.634 


4.634 


KNN 


0.802 


7.232 


7.481 


6.983 


6.983 


SVM 


0.841 


2.259 


2.497 


2.021 


2.021 


GISR 


0.830 


3.614 


3.855 


3.373 


3.373 


GISW 


0.845 


1.775 


2.012 


1.538 


1.538 



Table 0 shows the parameter estimates for the document feature character- 
istics of different component algorithms in MUDOF approach. Based on these 
parameter estimates and the corresponding feature characteristics, on category 
basis, the estimated classification errors of different algorithms on the categories 
can be obtained. It should be noted that, a negative parameter estimate will con- 
tribute to a smaller estimated classification error for an algorithm on a category. 
As a result, a feature characteristic with a large negative parameter estimate, 
will make itself a more distinctive feature in voting for the algorithm than oth- 
ers. For example, as shown in the table, PosTr has a more favourable impact for 
Rocchio, k-NN and SVM than other features. 
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Table 3. Parameter estimates for document feature characteristics of different algo- 
rithms. 



Features 


RO 


WH 


KNN 


SVM 


GISR 


GISW 


PosTr 


-4.75 


9.24 


-1.21 


-0.46 


5.98 


9.84 


PosTu 


-0.82 


-17.05 


-5.88 


-8.68 


-17.26 


-20.87 


AvgDocLen 


2028.77 


3199.54 


1514.31 


2275.34 


2567.84 


2676.94 


AvgTermVal 


103.51 


154.21 


81.74 


117.62 


151.54 


155.67 


A vgMax Term Val 


14.04 


23.21 


12.37 


15.59 


16.95 


21.39 


AvgMinTerm Val 


-69.79 


-104.37 


-68.77 


-92.12 


-124.89 


-138.73 


Avg Term Thre 


-2006.50 


-3164.39 


-1495.49 


-2250.07 


-2534.19 


-2640.33 



Table m shows the performance of all component algorithms and our pro- 
posed meta-learning models, based on the macro-recall and precision break-even 
point measures of the ten most frequent categories, which are those categories 
with top-ten number of positive training documents. It shows that, while lin- 
ear combination approach outperforms Rocchio, WH, k-NN as well as SVM, 
MUDOF can show even better performance over all the component algorithms. 
Such observation attributes to the fact that, under MUDOF approach, feature 
characteristics derived from those categories, with a larger number of positive 
training examples, have better predictive power for the classification errors of 
individual component algorithms. 



Table 4. Macro-recall and precision break-even point measures of the 10 most frequent 
categories. 



1 II RO WH KNN SVM GISR GISW| 


LCl LC2 LC3 MUDOF| 


Top 10 ABE| 0.730 0.851 0.781 0.859 0.814 0.871 


|0.867 0.865 0.868 0.874 | 



Besides of comparing the performance of the MUDOF approach over indi- 
vidual component algorithms, we set up the ideal combination of algorithms as 
another benchmark for our MUDOF approach. The ideal combination of algo- 
rithms is set up manually and is composed of the best algorithms, which are the 
true algorithms that MUDOF should recommend for each category accordingly. 
Table El depicts the selected algorithms and their performance within a category 
by MUDOF and the ideal combination (IDEAL), for the ten most frequent cat- 
egories. Meta-model can estimate the ideal algorithms (in bold) correctly for 6 
categories out of the 10 most frequent categories. For the remaining 4 categories, 
our meta-model can estimate the second best algorithms. Our results, not shown 
in the table, show that the meta-model can identify the ideal algorithms for 60 
categories out of the total 90 categories. 

Since the ideal combination consists of the most appropriate algorithm for 
each category, it sets an upper bound for the amount of improvement that can 
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Table 5. Macro-recall and precision break-even point measures of the 10 most frequent 
categories, for individual classifiers, meta-model approach (MUDOF) and the ideal 
combination (IDEAL). 



Category 


RO WH KNN SVM GISR GISW 


MUDOF 


IDEAL 


acq 

corn 

crude 

earn 

grain 

interest 

money-fx 

ship 

trade 

wheat 


0.829 0.870 0.859 0.931 0.932 0.909 
0.614 0.867 0.690 0.832 0.867 0.885 
0.793 0.853 0.823 0.871 0.813 0.869 
0.956 0.969 0.956 0.980 0.959 0.962 
0.803 0.887 0.820 0.917 0.804 0.910 
0.481 0.881 0.721 0.962 0.721 0.881 
0.582 0.718 0.674 0.717 0.681 0.756 
0.800 0.860 0.800 0.845 0.825 0.872 
0.732 0.763 0.740 0.715 0.714 0.788 
0.713 0.839 0.727 0.820 0.825 0.875 


0.932 GISR 

0.885 GISW 

0.869 GISW 

0.980 SVM 

0.910 GISW 

0.881 GISW 

0.756 GISW 

0.860 WH 

0.788 GISW 

0.875 GISW 


0.932 GISR 
0.885 GISW 
0.871 SVM 
0.980 SVM 
0.917 SVM 
0.962 SVM 
0.756 GISW 
0.872 GISW 
0.788 GISW 
0.875 GISW 


Top 10 ABE 


0.730 0.851 0.781 0.859 0.814 0.871 


0.874 


0.884 



be made under our meta-model. Table 0shows the comparison of performances, 
under different aspects of measures, between MUDOF and the ideal combination 
(IDEAL). Based on the utility measures as shown in the table, our results, not 
shown in this paper due to space limit, show that both MUDOF or the ideal 
combination have more than 10% improvement over Rocchio in both aspects. 
Improvement made by the meta-model over k-NN and GIS-R is more than 5% 
and 3% in all aspects respectively. When considering the improvement bound 
set by the ideal combination, our approach has attained more than 90% of the 
improvement bound for both Rocchio and k-NN under the Top 10 ABE measure. 



Table 6. Classification performances of meta-model (MUDOF) and the ideal combi- 
nation (IDEAL) under different groups of categories. 



Utility Measure 


MUDOE 


IDEAL 


Top 10 ABE 
All 90 MBE 


0.874 

0.858 


0.884 

0.868 



Table □ shows that incremental improvement can be obtained as more ro- 
bust and more classifiers are included in the algorithm pool of the meta-model. 
Performance obtained after adding GIS-R to Gombination 1 is increased. After 
replacing GIS-W with GIS-R, the improvement over Gombination 1 is more sig- 
nificant. After adding the robust SVM to Gombination 3, the MBE performance 
is further increased, as indicated in Gombination 4. Gombination 5 is actually 
the whole algorithm pool of our meta-model. The results demonstrate that our 
meta-model under MUDOF approach is not limited to combining a fixed num- 
ber of classifiers, or combining classifiers of same type, instead, it allows flexible 
additions or substitutions of different classifiers in its algorithm pool. 
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Table 7. Performance with different combinations of classifiers based on our meta- 
model under micro-recall and precision break-even point measure over 90 categories. 



Algorithm Combination 


MBE 


1) KNN-fWH-fRO 


0.820 


2) GIS-R-lKNN-bWH-l-RO 


0.842 


3) GIS-W-lKNN-bWH-bRO 


0.848 


4) SVM-bGIS-W-bKNN-bWH+RO 


0.854 


5) GIS-R-lSVM-fGIS-W-bKNN-bWH-fRO 


0.858 



5 Conclusions 

We have investigated two meta-model approaches for the task of automatic tex- 
tual document categorization. The first approach is the linear combination ap- 
proach. Under the approach, we propose three different strategies to combine the 
contributions of component algorithms. We have also proposed a second meta- 
model approach, called Meta-learning Using Document Feature characteristics 
(MUDOF), which employs a meta-learning phase using document feature charac- 
teristics. Different from existing categorization methods, MUDOF can automat- 
ically recommend a suitable algorithm for each category based on the category- 
specific statistical characteristics. Moreover, MUDOF allows flexible additions 
or replacement of different classification algorithms, resulting in the improved 
overall classification performance. Extensive experiments have been conducted 
on the Reuters-21578 corpus for both approaches. Satisfactory performance is 
obtained for the meta-learning approaches. 
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Abstract. The vocabulary problem in information retrieval arises 
because authors and indexers often use different terms for the same 
concept. A thesaurus defines mappings between different but related 
terms. It is widely used in modern information retrieval systems to 
solve the vocabulary problem. Chen et al. proposed the concept space 
approach to automatic thesaurus construction. A concept space contains 
the associations between every pair of terms. Previous research studies 
show that concept space is a useful tool for helping information searchers 
in revising their queries in order to get better results from information 
retrieval systems. The construction of a concept space, however, is 
very computationally intensive. In this paper, we propose and evaluate 
efficient algorithms for constructing concept spaces that include only 
strong associations. Since weak associations are not useful in thesauri 
construction, our algorithms use various prunning techniques to avoid 
computing weak associations to achieve efficiency. 

Keywords: concept space, thesaurus, information retrieval, text mining 



1 Introduction 

The vocabulary problem has been studied for many years I5ldl . It refers to the 
failure of a system caused by the variety of terms used by its users during human- 
system communication. Fumes et al. studied the tendency of using different 
terms among different users to describe a similar concept. For example, they 
discovered that for spontaneous word choice for concepts, in certain domain, the 
probability that two people choose the same term is less than 20% |5|. In an in- 
formation retrieval system, if the keywords that a user specifies in his query are 
not used by the indexer, the retrieval fails. To solve the vocabulary problem, a 
thesaurus is often used. A thesaurus contains a list of terms along with the rela- 
tionships between them. During searching, a user can make use of the thesaurus 
to design the most appropriate search strategy. For example, if a search retrieves 
too few documents, a user can expand his query by consulting the thesaurus for 
similar terms. On the other hand, if a search retrieves too many documents, a 
user can use a more specific term suggested by the thesaurus. Manual construc- 
tion of thesauri is a very complex process and often involves human experts. 
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Previous research works have been done on automatic thesaurus construction 

Q. 

In pj, Chen et. al. proposed the concept space approach to automatic the- 
saurus generation. A concept space is a network of terms and their weighted 
associations. The association between two terms is a quantity between 0 and 1, 
computed from the co-occurrence of the terms from a given document collection. 
Its value represents the strength of similarity between the terms. When the as- 
sociation between two terms is zero, the terms have no similarity. It is because 
the terms never co-exist in a document. When the association from a term a 
to another term b is near to 1, term a is highly related to term b in the docu- 
ment collection. Based on the idea of concept space, Schatz et al. constructed a 
prototype system to provide interactive term suggestion to searchers of the Uni- 
versity of Illinois Digital Library Initiative test-bed | 7 | . Given a term, the system 
retrieves all the terms from a concept space that has non-zero associations to 
the given term. The associated terms are presented to the user in a list, sorted 
in decreasing order of association value. The user then selects new terms from 
the list to refine his queries interactively. Schatz showed that users could make 
use of the terms suggested by the concept space to improve the recall of their 
queries. 

The construction of a concept space involves two phases: (1) an automatic 
indexing phase in which a document collection is processed to build inverted 
lists 0, and (2) a co-occurrence analysis phase in which the associations of 
every term pair are computed. Since there could be tens of thousands of terms 
in a document collection, computing the associations of all term pairs is very 
time consuming. In order to apply the concept space approach in large-scale 
document collections, efficient methods are needed. 

We observe that in many applications, a complete concept space is not 
needed. In typical document collections, most of the associations are zero, i.e., 
most term pairs are not associated at all. Also, among the non-zero associations, 
only a very small fraction of them have significant values. In typical applications, 
small-valued associations (or weak associations) are not useful. For example, in 
query augmentation, recommending weakly-associated terms to those keywords 
specified by a user query lowers the precision of the retrieval result. 

In this paper, we propose and evaluate a number of efficient algorithms for 
constructing concept spaces that only contain strong associations. The challenge 
is how one could deduce that a certain term pair association is weak without ac- 
tually computing it. We consider a number of pruning techniques that efficiently 
and effectively make such deductions. 

The rest of the paper is organized as follows. In Sectional we give a formal 
definition of concept space and discuss how term-pair associations are calculated. 
In Section |3 we consider three algorithms for the efficient construction of con- 
cept spaces that only contain strong associations. Experiment results comparing 
the performance of the algorithms are shown in Section 0 Finally, Section 0 
concludes the paper. 
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2 Concept Space Construction 

A concept space contains the associations, Wjk and Wkj, between any two terms 
j and k found in a document collection. Note that the associations are asym- 
metric, that is, Wjk ^ Wkj- According to Chen and Lynch p), Wjk is computed 
by the following formula: 

Wjk = — X WeightingFactor{k). (1) 

X^i=l ^ij 

The symbol dij represents the weight of term j in document i based on the 
term- frequency-inverse-document-frequent measure [ 2 |: 

N 

dij = tfij X log(— X Wj) 

ojj 

where 

tf ^j = number of occurrences of term j in document i, 
dfj = number of documents in which term j occurs, 

Wj = number of words in term j, 

N = number of documents. 

The symbol dijk represents the combined weight of both terms j and k in 
document i. It is defined as: 



where 



dzjk = tf,jk X log( 



N 

dfjk 



X Wj) 



( 2 ) 



tf^jk = number of occurrences of both terms j and k in document i, 
i.e., min(t/ij, 

dfjk = number of documents in which both terms j and k occur. 
Finally, WeightingFactor(k) is defined as: 



Weighting Factor {k) 



logjN /dfk) 
logN 



WeightingFactor(k) is used as a weighting scheme (similar to the concept 
of inverse document frequency) to penalize general terms (terms that appear in 
many documents). Terms with a high dfk value has a small weighting factor, 
which results in a small association value. Chen showed that this asymmetric 
similarity function (Wjk) gives a better association than the popular cosine func- 
tion inj. 

In the following discussion, for simplicity, we assume that Wj = 1 for all 
j (i.e., all terms are single- word ones). We thus remove the term Wj from the 
formula of dij and dijk- 
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As we have mentioned, concept space construction is a two-phase process. 
In the first phase (automatic indexing), a term-document matrix, TF is con- 
structed. Given a document i and a term j, the matrix TF returns the term 
frequency, In practice, TF is implemented using inverted lists. That is, for 
each term j, a linked list of [document-id,term-frequency] tuples is maintained. 
Each tuple records the occurrence frequency of term j in the document with the 
corresponding id. Documents that do not contain the term j are not included in 
the inverted list of j . 

Besides the matrix, TF, the automatic indexing phase also calculates the 
quantity dfj (the number of documents containing term j) and well as tfij 

(the sum of the term frequency of term j over the whole document collection) 
for each term j. These numbers are stored in arrays for fast retrieval during the 
second phase (co-occurrence analysis). 

In the co-occurrence analysis phase, associations of every term pair are 
calculated. According to Equation [I] (page E2I), to compute Wjk, we need 
to compute the values of three factors, namely, Er=i^b> and 

WeightingFactor{k). Note that 

E'J.. = X >»S (|:)l = log (^) X E ‘f.r 

Since both dfj and tfij already computed and stored during the au- 
tomatic indexing phase, Eti dij can be computed in constant time. Similarly, 
WeightingFactor{k) can be computed in constant time as well. 

Computing Ei=i djjfc, however, requires much more work. From Equation 0 
one needs to compute dfjk (i.e., the number of documents containing both terms 
j and k) and tfijk ici order to find EiLi dijk- Figure 0shows an algorithm 

for computing Wjk- 

The execution time of Weight is dominated by the for-loop in line 3. Basically, 
most of the work is spent on scanning the inverted lists of terms j and k. 

2.1 Algorithm A 

As we have mentioned, most of the associations have zero or very small values. 
Our goal is to construct a concept space that contains only strong associations. 
Given a user-specified threshold A, an association Wjk is strong if Wjk > A; 
otherwise, the association is weak. 

To construct a concept space with only strong associations, our base algo- 
rithm first identifies all term pairs that have non-zero associations and then 
applies the function Weight on each pair. In particular, during the automatic 
indexing phase, a two-dimensional triangular bit matrix C is built. The entry 
C{j,k) is set to 1 if there exists a document that contains both terms j and k 
(i.e., Wjk > 0); otherwise, C{j, k) is set to 0. 

During the co-occurrence analysis phase, the matrix C is consulted. The 
associations, Wjk and Wkj, are computed only if C{j,k) is set. Associations 
that are less than A are discarded. We call this base Algorithm A(Figure|3). 
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Weight(j, k) 

1 dfjk ^ 0 

2 sumJfijk 0 

3 for each (i,tfij) in the adjacency list of j 

4 do 

5 if there exists in the adjacency list of k 

6 then 



7 

8 
9 

10 

11 

12 

13 

14 

15 

16 



dfjk dfjk + 1 
if tfij < tfik 

then 

sumJfijk -s— sumJfijk + tfij 

else 

SUTTiJfijk ^ SUTfiJ f ijk tfik 

sum-dijk J- sumJfijk x \og{N/dfjk) 
sumjdij t— sumjfij x log{N/dfj) 
weighting _f actor Jt t— log{N / dfk) / log N 
return sum^dijk x weighting _f actor _k/ sumjdij 



Fig. 1. Function Weight 



Algorithm-A(A) 

1 (* Automatic indexing phase *) 

2 C -^0 

3 for each document i 

4 do 

5 for each term j in document i 

6 do 

7 tfij t— no. of occurrences of j in i 

8 append (i,tfij) to j’s inverted list 

9 sumjfij t— sumJfij + tfij ; dfj t— dfj + 1 

10 for each term pair j and k {j < k) in document i 

11 do 

12 C{j, fc) t— 1 (* j, k have non-zero associations *) 

13 (* Co-occurrence analysis phase *) 

14 for each C{j, k) = 1 

15 do 

16 if Weight(ji, fc) (or Weight(k, j)) > A 

17 then output IN e\gl\t{j,k) {or Weight{k,j)) 



Fig. 2. Algorithm A 
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3 Pruning Algorithms 

Algorithm A is not particularly efficient. It basically computes all possible non- 
zero associations before filtering out those that are weak. As an example, we 
ran Algorithm A on a collection of 23,000 documents. It took 100 minutes for 
the algorithm to terminate. The major source of inefficiency lies in the Weight 
function, which scans two inverted lists for every non-zero association. In the 
collection, there are about 9 millions non-zero associations, and hence Algorithm 
A had to scan about 18 millions inverted lists. In this section, we consider a few 
algorithms for improving the efficiency of Algorithm A. All of these algorithms 
share the following feature. Before computing an association Wjk, each one first 
computes an easy-to-compute estimate kkjfc. If the estimate suggests that Wjk 
is likely to be strong, the algorithm will execute the more expensive Weight 
function. The algorithms gain efficiency by avoiding the computation of weak 
associations. 



3.1 Algorithm B 



Our first efficient algorithm, B, computes an estimate W^f. that is always an upper 
bound of ITjiQ. If ITj^. is smaller than the threshold A, we have Wjk < W'/. < A. 
Hence, the association Wjk must be weak and needs not be computed. 

Recall that 

AT T 

^ Weighting Factor {k) 

Si=l 



and 



N 



N 






i=l 



i=l 



N 

tfijk X log(^) 



N 



N 



By definition, = min{tfij,tfik). Note that 



N N 

tfijk = '^mm{tfij,tfik) < min 
2=1 2=1 





Also, unless the terms j and k are negatively co-related, we have 

dfjk ^df^ d^ 

N - N ^ N ' 

That is, the probability that a random document contains both terms j and k 
is at least as large as that probability when j and k are independent. 

Now, consider Wj^., defined as: 



W',k = 



T 

Z^2=l 



X WeightingFactor{k). 



^ This condition holds unless terms j and k are negatively co-related. 
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We have > Wjk- Note that all the quantities that are needed to compute 
Wjj, are made available in the automatic indexing phase. Hence, can be 
computed in constant time. Figure 0 shows the function Weightl for computing 
Wjj, and Figure 21 shows Algorithm B, which uses Weightl as a pruning test. 

WEIGHTl(j, k) 

1 sum_dijk ■<— mm{sum_tfij,sum_tfik) x log((A x N)/{dfj x dfk)) 

2 sum^dij ■4— sumjtfij x log{N / dfj) 

3 weighting _f actor _k 4— log(A/d/fc)/ log N 

4 return sum_dijk x weighting _f actor _k/ sum-dij 



Fig. 3. Function Weightl 



Algorithm-B(A) 

1 (* Automatic indexing phase *) 

2 (* same as Algorithm A *) 

3 (* Co-occurrence analysis phase *) 

4 for each C{j,k) = 1 

5 do 

6 if Weightl(j, fe)(or Weightl{k, j)) > A 

7 then output \Ne\g\r\t{j, k){or Weight{k, j j) 



Fig. 4. Algorithm B 



3.2 Algorithm C 

With Wjf., we replace the term N/dfjk by the bound N‘^/{dfj ■ dfk)- This bound 
can be very loose. Many weak associations Wjk may have the estimate 
exceeds A and hence the expensive Weight function is called. To improve the 
effectiveness of pruning, we consider another association estimate. 

We first consider how the quantity dfjk can be estimated efficiently. For each 
term j, we compute a signature, Sj. Each signature is an array of Q bits (with 
indices from 0 to Q — 1). Let iJ() be a hash function such that, given a document 
i, H{i) returns an index in [0..Q — 1]. For each document i, Sj[H{i)] is set to 1 
if document i contains term j; otherwise Sj[H{i)] is 0. 

Given two terms j and k, we estimate dfjk by counting the number of ‘1’ 
bits in the result of applying the bit-wise AND operation on the signatures Sj 
and Sk- We denote this estimate by dfjk- Note that the estimate dfjk could be 
incorrect if Q is too small (which leads to many hash collisions). However, as we 
have observed in our experimental study, setting Q to 5% of the total number 
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of documents in a collection is sufficient to reduce the error probability to a 
negligible level. Also, we note that computing dfjk using the small bit-vector 
signatures is much faster than computing the exact value of dfjk by scanning 
the inverted lists. 

Another estimate we consider is to approximate the value tfijk- First, 

if we use maxpitfpj) to denote the largest term frequency of a term j over any 
document, then clearly, tfij < maxp{tfpj) for any document i. Hence, 

tfijk = mm{tfij,tfik) < min(max(t/pj),max(t/pfe)). 

P P 



Thus, 

N 

^ ifijk ^ E min(max(t/y ), max(f/i/c)) 
i.fee doc i j,kG doc i 

= dfjk X min(max(t/pj),max(t/pfc)). 

p p 

Substituting these estimates to the association formula, we obtain 

min(max(t/y),max(t/ifc)) x df x log(J^) 

W”k = ^ WeightingFactor(k) . 

Si=l 

Note that W”f. can be computed in constant time. 

Figure|3shows the function Weight2 for computing W”i.. Comparing Weightl 
and Weight2, we note that Weightl is a bit more efficient than Weight2 since no 
bit vector processing is needed. However, the estimate computed by Weightl is 
less tight. Our algorithm (Algorithm C) thus uses Weightl as the first pruning 
test, then applies Weight2 if necessary. Figure 0shows Algorithm C. 



WEIGHX2(j, k) 

1 dfjk number of ‘1’ bits in Sj A Sk 

2 sum^dijk <— mm{ma.x jtf pj, max jtfpk) x dfjk x \og{N/dfjk) 

3 suru-dij •«— sum-tfij x log{N / dfj) 

4 weighting _f actor _k log(A/d/fc)/ log A 

5 return sum^dijk x weighting _f actor _k / sum^dij 



Fig. 5. Function Weight2 



3.3 Algorithm D 

In Algorithm C, the quantity dfjk is estimated by processing the signatures Sj 
and Sk- We note that dfjk appears only in a log term (see Equation |3). A mild 
mis-estimation of dfjk should not affect the value of the association Wjk by 
much. Moreover, by inspecting a few document collections, we discovered that 
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Algorithm-C(A) 

1 (* Automatic indexing phase *) 

2 C^O 

3 for each document i 

4 do 

5 for each term 7 in document i 

6 do 

7 tfij •<— no. of occurrence of j in i 

8 append to j’s inverted list 

9 Sj[H{i)] ^ 1 

10 sumdfij <— sumdfij + tfij 

11 maxdfpj ma,x{max-tfpj ,tfij) ; dfj dfj + 1 

12 for each term pair j and k {j < k) in document i 

13 do 

14 C{j, fc) •<— 1 (* j, k have non-zero associations *) 

15 (* Co-occurrence analysis phase *) 

16 for each C{j, k) = 1 

17 do 

18 if Weightl(j, fc)(or Weightl{k, j)) > A 

19 then if Weight2(j, fc) (or Weight2{k, j)) > A 

20 then output or Weight{j, k){or Weight{k, j)) 



Fig. 6. Algorithm C 



for terms that have a not-too-small association (e.g., > 0.6), it is almost always 
the case that dfjk = Tain{dfj,dfk)- Our next algorithm uses this observation to 
compute a quick estimate, for the association between terms j and k: 



= 



tf^k) X log( 



N 

min{dfj,dfk) 






X WeightingFactor(k). 



Same as Weightl, we estimate Eili ijk by an upper bound 

Eti tfik'j ■ Also, we use imn{dfj,dfk) to approximate dfjk- Fig- 
ure [Oshows the function Weights for computing W”j^. Figure Elshows Algorithm 
D which uses Weights as the pruning test. 



WEIGHX3(j, k) 

1 sumAijh min{sumjtfij,sumJ.fik) x log(A'/ mm{dfj,dfk)) 

2 sum_dij ■4— sumjtfij x log{N / dfj) 

3 weighting _f actor _k 4— log{N / dfk) / \og N 

4 return sum^dijk x weighting _f actor Jt / sum^dij 



Fig. 7. Function Weights 
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Algorithm-D(A) 

1 (* Automatic indexing phase *) 

2 (* same as Algorithm A *) 

3 (* Co-occurrence analysis phase *) 

4 for each C{j,k) = 1 

5 do 

6 if Weight3(j, fc) (or Weight3{k, j)) > A 

7 then ontput \Ne\ght[j,k) {or Weight{k, j)) 

Fig. 8. Algorithm D 



4 Performance Evaluation 

In this section we compare the performance of the four algorithms. Due to space 
limitation, only some representative results are shown. We applied the algo- 
rithms on a few document collections. For example, we had a medical document 
collection (Medlars) taken from the SMART project Q, and a collection of web 
pages taken from a news web site. For illustration, we show the performance 
results using the collection of news documents. 

The document collection consists of 22,613 documents with 55,772 terms. 
The document database is 20.9 MBytes large (after stop- word removal and stem- 
ming). There are about 1.2 x 10® non-zero associations in the dataset. The bar 
graph in Figure El shows the frequency distributions of the non-zero associations 
under 10 ranges of association values. This distribution shows that only a very 
small fraction of the associations have large values. 



90000 = 
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No. of 60000 — 
term “ 

40000 - 
30000 - 
20000 - 
10000 - 
oA— 



Association 



Fig. 9. Distribution of non-zero associations 



We ran the algorithms A, B, C, and D on a 250 MHz UltraSparc machine. 
Fi gii re irni shows the runtime of the algorithms. For Algorithm B, the number of 
bits for a signature vector is set to 5% of the total number of documents in the 
collection (i.e., 2,789 bits). 
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Fig. 10. Runtime (secs) of algorithms 



From the figure, we see that Algorithm A is the slowest, taking about 3,500 
seconds to complete. The runtime of Algorithm A is independent of A, since it 
simply computes all non-zero associations. Algorithms B, C, and D are much 
faster than A. For example, when A = 0.9, Algorithm D only takes 380 seconds 
- a speedup of 9.2 times over Algorithm A. 

Among the three pruning algorithms. Algorithm D is the most efficient, fol- 
lowed by C, and then by B. Moreover, their speed increases with A. This is 
because a larger A value means that more associations are considered weak. 
This allows the algorithms to prune more weak associations. 

It is interesting to see how much unnecessary work that each algorithm has 
done. Figure El shows the number of weak associations that each algorithm has 
computed (using the Weight function). The figure shows, for example, that when 
A = 0.9, Algorithm B computed about 20 million associations whose values are 
less than 0.9. Algorithm A, on the other hand, computed about 110 million 
of such associations, or 5.5 times as many as that of Algorithm B. Algorithms 
C and D are even more effective in pruning weak associations. For example, 
at A = 0.9, Algorithm D only computed 349638 weak associations, or 3% of 
that of Algorithm A. Finally, we note that even though algorithms C and D are 
equally effective in pruning weak associations. Algorithm C is less efficient than 
Algorithm D. This is because Algorithm C uses signature vectors Sj and Sk to 
estimate dfjk- Processing these signatures requires 0{Q) time, where Q is the 
number of bits in a signature vector. Computing the function Weight2 thus takes 
0{Q) time. Algorithm D, on the other hand, computes the function Weights in 
constant time. Hence, Algorithm D is faster. 

As we have discussed, the efficiency of the pruning algorithms come from the 
use of estimates that act as pruning tests to filter out weak associations (without 
computing them) . Although very unlikely, it is possible that these pruning tests 
fail. That is to say, it is possible (with a very small probability) that a strong 
association is mistakenly indicated by a pruning test to be weak. In such a 
case, the association is (erroneously) not computed. We compare the output 
of Algorithm A with those of the three pruning algorithms. Fortunately, we 
discover that only a very small fraction of the strong associations are discarded. 
In particular. Algorithm B never misses any strong associations; also, for A > 0.7, 
algorithms C and D missed at most 8 and 7 strong associations (out of 1844552). 
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Fig. 11. Number of weak associations computed by the algorithms 



5 Conclusion 

This paper studied the problem of concept space construction. Previous studies 
have shown that the concept space approach to automatic thesaurus construc- 
tion is a useful tool for information retrieval. The construction of concept spaces, 
however, is very time consuming. In many applications, a full concept space is 
not needed, in particular, only strong associations are used. We proposed three 
pruning algorithms for constructing concept spaces containing only strong asso- 
ciations. We evaluated these algorithms using a number of document collections. 
We found that the three pruning algorithms are very effective in avoiding the 
computation of unneeded weak associations. A 10-time speedup of the construc- 
tion process can be achieved. 
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Abstract. We address the problem of Topic Detection and Tracking 
(TDT) and subsequently detecting trends from a stream of text 
documents. Formulating TDT as a clustering problem in a class of 
self-organizing neural networks, we propose an incremental clustering 
algorithm. On this setup we show how trends can be identified. Through 
experimental studies, we observe that our method enables discovering 
interesting trends that are deducible only from reading all relevant 
documents. 

Keywords: Topic detection, topic tracking, trend analysis, text mining, 
document clustering 



1 Introduction 

In this paper, we address the problem of analyzing trends from a stream of text 
documents, using an approach based on the Topic Detection and Tracking initia- 
tive. Topic Detection and Tracking (TDT) ^ research is a DARPA-sponsored 
effort that has been pursued since 1997. TDT refers to tasks on analyzing time- 
ordered information sources, e.g news wires. Topic detection is the task of de- 
tecting topics that are previously unknown to the system P]. Topic here is an 
abstraction of a cluster of stories that discuss the same event. Tracking refers to 
associating incoming stories with topics (i.e. respective clusters) known to the 
systemjH|. The topic detection and tracking formalism together with the time 
ordering of the documents provides a nice setup for tracing the evloution of 
a topic. In this paper, we show how this setup can be exploited for analyzing 
trends. 

Topic detection, tracking and trend analysis, the three tasks being performed 
on incoming stream of documents, necessitate solutions based on incremental 
algorithms. A class of models that enable incremental solutions are the Adaptive 
Resonance Theory (ART) networks |3, which we shall adopt in this paper. 
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2 Document Representation 

We adopt the traditional vector space modeljni for representing the documents, 
i.e. each document is represented by a set of keyword features. We employ a 
simple feature selection method whereby all words appearing in less than 5% of 
the collection are removed and, from each document, only the top n number of 
features based on tf.idf ranking are picked. Let M be the number of keyword 
features selected through this process. With these features, each document is 
converted into a keyword weight vector 

a = (oi, 02, . . . , om) ( 1 ) 

where aj is the normalized word frequency of the keyword Wj in the keyword 
feature list. The normalization is done by dividing each word frequency with the 
maximum word frequency. 

We assume that text streams are provided as document collections ordered 
over time. The collections must be disjoint sets but could have been collected over 
unequal time periods. We shall call these time-ordered collections as segments. 



3 ART Networks 

Adaptive Resonance Theory (ART) networks are a class of self-organizing neural 
networks. Of the several varieties of ART networks proposed in the literature, 
we shall adopt the fuzzy ART networks |2| . 

Fuzzy ART incorporates computations from fuzzy set theory into ART net- 
works. The crisp (nonfuzzy) intersection operator (fl) that describes ART 1 
dynamics is replaced by the fuzzy AND operator (A) of fuzzy set theory in the 
choice, search, and learning laws of ART 1. By replacing the crisp logical ope- 
rations of ART 1 with their fuzzy counterparts, fuzzy ART can learn stable 
categories in response to either analog or binary patterns. 

Each fuzzy ART system includes a field, Fq, of nodes that represents a current 
input vector; a field Fi that receives both bottom-up input from Fq and top- 
down input from a field, F 2 , that represents the active code or category. The Fq 
activity vector is denoted I. The Fi activity vector is denoted x. The F 2 activity 
vector is denoted y. 

Due to space constraints, we skip the description of fuzzy ART learning 
algorithm. The interested reader may refer to |2| for details. 

4 Topic Detection, Tracking, and Trend Analysis 



In this section we present our topic detection, tracking and trend analysis me- 
thods. 
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4.1 Topic Detection Algorithm 

As described in Section 3, ART formulates recognition categories of input pat- 
terns by encoding each input pattern into a category node in an unsupervised 
manner. Thus each category node in F2 field encodes a cluster of patterns. In 
other words, each node represents a topic. Hence, identification of new topics 
translates to the method of creation of new categories in the F2 field as more 
patterns are presented. Using this idea, we derive the topic detection algorithm 
in Table I. 



Table 1. Topic Detection Algorithm. 



Step 1. Initialize network and parameters. 

Step 2. Load previous network and cluster structure, if any. 

Step 3. Repeat 

- present the document vectors 

- train the net using fuzzy ART Learning Algorithm 
until convergence 

Step 4. Prune the network to remove low confidence category nodes 
Step 5. Save the net and cluster structure. 



4.2 Topic Tracking Algorithm 

For tracking new documents, the latest topic structure is loaded before processing 
the documents. For an incoming document, the activities at the F2 field are 
checked to select the winning node, i.e. the one receiving maximum input. The 
document is then assigned to the corresponding topic. This is the idea behind 
the tracking algorithm presented in Table II. 



Table 2. Topic Tracking Algorithm. 



Step 1. Initialize network and parameters. 

Step 2. Load previous network and cluster structure, if any. 
Step 3. Present the document to be tracked, to the net 
Step 4. Assign the document to the topic corresponding to the 
winning category node, i.e. category node that receives 
maximum input. 
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4.3 Trend Analysis 

The topic detection and tracking setup together with the time ordering of the 
documents provides a natural way for topic- wise focussed trend analysis. In 
particular, for every topic, suppose we plot the number of documents per segment 
versus time. This plot can be thought of as a trace of the evolution of a topic. The 
‘ups’ and ‘downs’ in the graph can be used to deduce the trends for this topic. 
For more specific details on the trends, one can zoom in and view documents on 
this topic segment- wise. This process is illustrated in the following section. 

5 Experiments 

For our experiments, we have grabbed daily news articles from CNET and ZD- 
Net and grouped the articles into weekly segments. Starting from 1st week of 
September 2000 up till 4th week of October 2000, we collected 8 segments in 
all. Totally there were 1468 documents at an average of about 180 documents 
in each segment. Documents in each segment are converted into weight vectors 
as described in Section 2. We then applied our topic detection and tracking and 
performed trend analysis. Some qualitative results are presented below: 

5.1 Topic Detection and Tracking 

Typically we observed 10 to 15 new topics being identified per segment when 
choice parameter a = 0.1 and vigilance parameter p = 0.01 (ignoring small 
clusters with 1 or 2 documents only). 

A list of some of the hot topics that have been identified by the topic detection 
algorithm can be viewed at http://textmining.krdl.org.sg/people/kanagasa/tdt. 
The tracking results can also be viewed at the same URL. We skip the details 
due to space constraints. 

5.2 Trend Analysis 

The evolution graphs for some selected topics are shown below. Time is repre- 
sented through the segment ID which takes values 1,- • -,8. ID=1 corresponds to 
Sep 1st week, ID=2 corresponds to Sep 2nd week and so on. 

The topics ‘MS Case’ (i.e. Microsoft Case), ‘Linux’ and ‘Windows ME’ have 
been plotted in Fig E The ‘MS Case’ topic shows an initial up trend early 
September. An examination of the documents under this topic reveals the reason 
to be Bristol Technology ruling against Microsoft. Similarly the topic on ‘Linux’ 
shows a peak for early October when the Open source conference was held. 
‘Windows ME’ graph peaks during September 2nd week coinciding with Win 
ME release. 

The topics ‘Apple’ and ‘Hackings’ have been plotted in Fig |21 The ‘Apple’ 
topic shows an up trend during mid September when Apple Expo was on. The 
Microsoft hack-in can be seen to have lead to the sudden peak in ‘Hackings’ 
topic around late October. 
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Fig. 1. Trends for ‘MS Case’, ‘Linux’ and ‘Windows ME’. 




Fig. 2. Trends for ‘Apple’ and ‘Hackings’. 



The above study thus shows that our method can be used to detect hot 
topics automatically and track the evolution of detected topics. The method 
also serves to spot emerging trends (with respect to the timescale defined) on 
topics of interest. 

6 Related Work 

TDT research has been predominantly ‘pure IR’ based and can be categorized 
as based on either incremental clustering (e.g. 0) or routing-queries (e.g. |n|). 
(One notable exception is the tracking method by Dragon Systems which is based 
on language- modelling techniques.) Incremental clustering based methods come 
the closest to our work, but we use ART networks for document processing in 
contrast to the traditional document similarity measures. Our main motivation 
is that ART networks enable truly incremental algorithms. 

Trend analysis for numerical data has been well investigated. For free-text, 
where the challenge is tough, we are aware of only very few papers. 0 defi- 
nes concept distributions and propose a trend analysis method by comparing 
distributions of old and new data. Typically, the trends discovered are of the 
type “keyword ‘napster’ appeared x% more now than in old data”, “keyword 
‘divx’ appeared y% less now than in old data”, etc. |3j uses the popular a-priori 
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algorithm employed in association-rule learning, for finding interesting phrases. 
Trend analysis is done by applying a shape based query language on the iden- 
tified phrases. Queries like ‘Up’ or ‘BigDown’ could be used to identify upward 
and strong downward trends respectively, in terms of phrases. However, there 
could be potentially large number of candidate phrases that could make this 
method inefficient. 

In contrast, our trend analysis method being based on topic detection and 
tracking enables finding specific, topic-wise trends. The TDT formulation offers 
several advantages. The topic detection and tracking step enables the trend 
analysis be focussed and more meaningful. Since the documents under each topic 
are relatively small, the analysis can be done efficiently. (On a related note, the 
ART learning algorithm can be implemented parallelly and this implies potential 
further speedup.) 

7 Conclusion 

We have addressed the problem of analyzing trends from a stream of text using 
the TDT approach. First we have formulated TDT as a clustering problem in 
ART networks and proposed an incremental clustering algorithm. On this setup 
we have shown how trends can be identified. Through experimental studies, 
we have found our method enables discovering interesting trends that are not 
directly mentioned in the documents but deducible only from reading all relevant 
documents. 
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Abstract. In this work we developed a new automatic hypertext con- 
struction method based on a proposed text mining approach. Our 
method applies the self-organizing map algorithm to cluster some flat 
text documents in a training corpus and generate two maps. We then use 
these maps to identify the sources and destinations of some important 
hyperlinks within these training documents. The constructed hyperlinks 
are then inserted into the training documents to translate them into hy- 
pertext form. Such translated documents form the new corpus. Incoming 
documents can also be translated into hypertext form and added to the 
corpus through the same approach. Our method had been tested on a set 
of flat text documents collecting from several newswire sites. Although 
we only used Chinese text documents, our approach can be applied to 
any document that can be transformed to a set of indexed terms. 



1 Introduction 

The use of hypertexts for information representation has been widely recognized 
and accepted because they provide a feasible mechanism to retrieve related doc- 
uments. Unfortunately, most hypertext documents were created manually using 
some kind of authoring tools. Although such authoring tools are easy to use and 
provide sufficient functionality for individual users, manual construction of hy- 
pertexts is still, if not impossible, a very hard work. Moreover, manual construc- 
tion is always unstructured and unmethodological. To remedy such inefficiency, 
we need a method to automatically transform a ’flat’ text into a hypertext rather 
than creating a hypertext from the ground up. Thus research of automatic hy- 
pertext construction arises rapidly in recent years. 

To transform a flat text to a hypertext we need to decide where to insert a 
hyperlink in the text. A hyperlink connects between two documents where at 
one end of the hyperlink is the source text which may be an individual term or 
a sentence and at the other end is the destination text which may be another 
document or a different location of the same document. Different types of hy- 
perlinks may be used depending to the kinds of functionality that need to be 
implemented by the hypertext. For example, according to Agosti et al. Q, there 
are three types of hyperlinks, namely structural links, referential links, and as- 
sociative links. The first two types of hyperlinks are usually explicit and may be 
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easily created manually or automatically. The associative hyperlinks, however, 
require a understanding of the semantics of the connecting documents. Such un- 
derstanding also requires much human effort. Nowadays, automatic creating of 
associative hyperlinks, or semantic hyperlinks, plays a central role in the devel- 
opment of automatic hypertext construction methods because these hyperlinks 
may provide the most effective exploring paths to fulfill the users’ information 
need. The critical point in creating associative hyperlinks is to find the docu- 
ments which are semantically relevant to the sources of these hyperlinks. Such 
semantic relevance could be revealed by a text mining process. 

Text mining concerns of discovering knowledge from a textual database and 
attracts much attention from both researchers and practitioners. The problem is 
not easy to tackle due to the semi-structured or unstructured nature of the text 
documents. Many approaches have been devised in recent years (for example, 
0). In this work, we apply the self-organizing map model to perform the text 
mining process. Since the self-organizing process could reveal the relationship 
among documents as well as indexed terms, such text mining process may also 
be used to find the associative hyperlinks. 

2 Related Work 

Research on automatic construction of hypertext arose mostly from the infor- 
mation retrieval field. A survey of the use of information retrieval techniques for 
the automatic hypertext construction can be found in fp. There was no neural 
network based methods had made significant contribution in this field according 
to their survey. 

Text mining has received lots of attention in recent few years. Many 
researchers and practitioners have involved in this field using various ap- 
proaches p|. Among these approaches, the self-organizing map (SOM) |3| models 
played an important role. Lots of works had used SOM to cluster large collection 
of text documents. However, there is few works had applied text mining ap- 
proaches, particularly the SOM approach, to automatic hypertext construction. 
One close work by Rizzo et al. jS] used the SOM to cluster hypertext documents. 
However, their work was used for interactive browsing and document searching, 
rather than hypertext authoring. 

3 Text Mining by Self-Organizing Maps 

Before we can create hyperlinks, we first perform a text mining process, which is 
similar to the work described in on the corpus. The popular self-organizing 
map (SOM) algorithm is applied to the corpus to cluster documents. We adopt 
the vector space model jO] to transform each document in the corpus into a 
binary vector. These document vectors are used as input to train the map. We 
then apply two kinds of labeling process to the trained map and obtain two 
feature maps, namely the document cluster map (DCM) and the word cluster 
map (WCM). In the document cluster map each neuron represents a document 
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cluster which contains several similar documents with high word co-occurrence. 
In the word cluster map each neuron represents a cluster of words which reveal 
the general concept of the corresponding document cluster associated to the 
same neuron in the document cluster map. 

Since the DCM and the WCM use the same neuron map, a neuron represents 
a document cluster as well as a word cluster simultaneously. By linking the DCM 
and the WCM according to the neuron locations we may discover the underlying 
ideas of a set of related documents. This is essentially a text mining process 
because we can discover related terms among a set of related documents through 
their co-occurrence patterns which are hard to extract directly. By virtue of the 
SOM algorithm, terms that are often co-occurred will tend to be labeled to the 
same neuron, or neighboring neurons because the neurons are willing to learn 
all of them simultaneously. Thus the co-occurrence patterns of indexed terms 
may be revealed. Moreover, the terms that associated to the same neuron also 
reveal the common themes of the associated documents of the neuron. Essentially 
these related terms construct a thesaurus that is derived from the context of the 
underlying documents rather than their grammatical meaning. Thus the indexed 
terms associated to the same neuron in the WCM compose a pseudo-document 
which represents the general concept of the documents associated to that neuron. 
Therefore our approach may also generate a thesaurus automatically, another 
text mining application. 

4 Automatic Hypertext Construction 

We divide the hypertext construction process into two parts. In the first part 
we concern about finding the source of a hyperlink. In the second part we will 
try to find the destination of a hyperlink. The entire process will be described 
in the following subsections. 



4.1 Finding Sonrces 

To find the source of a hyperlink within a document, we should first decide the 
important terms that are worth further exploration. In this work, two kinds of 
terms are used as sources. The first kind of terms include the terms that are 
the themes of other documents but not of this document. Such terms are gener- 
ally recognized as necessary sources of hyperlinks because they fulfill users’ need 
during browsing this document. We call these hyperlinks the inter-cluster hyper- 
links because they often connect documents which locate on different document 
clusters in the DCM. The reason of such disparity is that a document cluster in 
the DCM contains documents that have common terms often co-occur. That is, 
the corresponding word cluster will contain these common terms. Therefore, the 
first kind of terms are used in creating inter-cluster hyperlinks. 

The second kind of terms include terms that are the themes of this document. 
This kind of terms are used to include documents that are related to this doc- 
ument for referential purpose. Such hyperlinks may be created by adding links 
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between each pair of documents associated to the same document cluster in the 
DCM. Since these documents share some common concepts after the text mining 
process, we may consider them related and use them to create the intra-cluster 
hyperlinks. 

In the following we will describe how to obtain the sources of these two kinds 
of hyperlinks. To create a inter-cluster hyperlink in a document Dj associated to 
a word cluster Wc, we may find its source by selecting a term that is associated 
to other word clusters but not Wc- That is, a term kt is selected as a source if: 

ki ^ Wc and ki £ Wm for m c, (1) 

where Wm is the set of words associated to neuron m in the WCM and Wc is 
the word cluster associated to the document cluster that contains Dj. To find 
the sources of the intra-cluster hyperlinks in document Dj, we simply find the 
terms in the word cluster Wc which is associated to the document cluster that 
contains Dj. That is, we select all ki if: 

k, £ Wc. ( 2 ) 



4.2 Finding Destinations 

It is straightforward to find the destinations of hyperlinks after determining 
the sources as described in Sec. 14.11 For an inter-cluster hyperlink in document 
Dj with source ki, we assign a document Di as its destination if it fulfills the 
following requirements: 

1. Dj and Di belong to different document clusters and ki ^ Wc. 

2. ki £ Wm, m ^ c and 

Wim = max Wii, (3) 

Dc,l<l<M 

where c is the neuron index of the word cluster that contains Dj. 

3. The distance between Di and Dj is minimum, i.e. 

lid,- — dd|= min lldi — d,||. (4) 

The first requirement states that the destination document should be reasonably 
differed from the source document and the source document should not have ki 
as its theme. The second requirement selects the word cluster that is the most 
relevant to ki. Since a word cluster may have several documents associated to 
it, we need the third requirement to choose the most similar one. 

To find the destination of a intra-cluster hyperlink starting from ki £ Wc, we 
simply connect it to a document associated to neuron c in the DCM because this 
document cluster contains the most related documents. Since a document cluster 
may contain multiple documents, the document which has minimum distance to 
the source document (the document containing the source) will be selected as 
the destination of this hyperlink. This is also formulated in Eq. 0] where dj and 
di are the source and destination documents, respectively. 
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5 Experimental Results 

The experiments were based on a corpus collected by the authors. The test 
corpus contains 3268 news articles which were posted by the CNA (Central News 
Agencjfl) during Oct. 1, 1996 to Oct. 10, 1996. Each article contains a subject line 
starting with a The test documents were written in Chinese, so we applied a 
Chinese term extraction program to every document in the corpus and obtained 
the indexed terms for each document. The overall vocabulary contains 10937 
terms for these 3268 documents. To reduce the number of terms, we manually 
discarded those terms that occur only once in a document. We also discarded 
those terms that occur in a manually constructed stoplist which contains 259 
Chinese terms. This reduces the number of indexed terms to 1976. 

Each document is transformed to an 1976-dimensional binary vector and fed 
into a SOM network for training. The network contains a 20x20 grid of neurons 
which each is represented by a 1976-dimensional synaptic weight vector. All doc- 
uments were used to train the network. We set the maximum training epoch to 
500 and the training gain to 0.4. After the training process, we labeled the neu- 
rons with documents and terms respectively and obtained the DCM and WCM. 
We started creating hypertext documents after obtaining the DCM and the 
WCM. The sources and destinations were determined by the method described 
in Sec. El The spanning factors cti and (T 2 were set to 10 and 5 respectively. For 
each document we also generated an aggregate link which contains hyperlinks 
to all relevant documents. The flat text documents were then converted to their 
corresponding hypertext form by a text conversion program. We adopted the 
standard HTML format to represent the hypertexts for easy access via Internet. 
An example flat text document and its hypertext form are shown in Figure E0 
The underlined terms in the figure were the obtained sources where the italic 
ones depicted the sources for intra-cluster hyperlinks. 

In Figure n a intra-cluster hyperlink connects the source document to a ran- 
domly selected document in the same document cluster. A hyperlink was created 
on every occurrence of each source term. The aggregate links were appended to 
the converted hypertexts beneath the ’’Relevant links:” line. The number of rel- 
evant documents is equal to the number of documents in the document cluster 
to which the source document was associated minus one. 

6 Conclusions 

In this work we devised a novel method for automatic hypertext construction. To 
construct hypertexts from flat texts we first developed a text mining approach 
which adopted the self-organizing map algorithm to cluster these flat texts and 
generate two feature maps. The construction of hypertexts were achieved by an- 
alyzing these maps. Two types of hyperlinks, namely the intra-cluster hyperlinks 

^ http://www.cna.com.tw 

^ Interested readers may access these hypertext documents at the author’s web site 
http : //www. im. cju. edu. tw/~hcyang/ahc. 
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Fig. 1. An example flat text(left) and its hypertext form(right) 



and the inter-cluster hyperlinks, were created. The intra-cluster hyperlinks cre- 
ated connections between a document and its relevant documents while the inter- 
cluster hyperlinks connected a document to some irrelevant documents which 
reveal some keywords occurred in the source document. Experiments showed 
that not only the text mining approach successfully revealed the co-occurrence 
patterns of the underlying texts, but also the devised hypertext construction 
process effectively constructed semantic hyperlinks among these texts. 
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Abstract. Human beings generally analyze information with some kinds of 
semantic expectations. This not only speeds up the processing time, it also helps 
to put the analysis in the correct context and perspective. To capitalize on this 
type of intelligent human behavior, this paper proposes a semantic expectation- 
based knowledge extraction methodology (SEKE) for extracting causation 
relations from text. In particular, we study the application of a causation semantic 
template on the Hong Kong Stock market movement (Hang Seng Index) with 
English financial news from Reuters, South China Morning Post and Hong Kong 
Standard. With one-month data input and over a two-month testing period, the 
system shows that it can correctly analyzes single reason sentences with about 
76% precision and 74% recall rates. If partial reason extraction (two out of one 
reason) is included and weighted by a factor of 0.5, the performance is improved 
to about 83% and 81% respectively. As the proposed framework is language 
independent, we expect cross lingual knowledge extraction can work better with 
this semantic expectation-based framework. 

Keywords: knowledge extraction, semantic-based natural language processing, 
expectation-based information extraction. 



1 Introduction 

With rapid advances of technologies and the availability of vast information in the 
World Wide Web, information is easily accessible. This overwhelming load of 
information, especially in the form of articles (text), calls for good solution(s) in 
knowledge extraction technology to help understand the contents. The success of this 
important area will help us handle textual information more efficiently, such as 
indexing and relevant article searching, and more effectively, such as direct 
knowledge understanding and gathering. 

In the information rich financial sector, active research has been carried out 
with the information needs of banks and financial companies [1]. Typical examples 
include information extraction projects about take over activities from financial news 
(e.g. [2], [3]). In a broader view, the movements of major financial markets around 
the world such as New York, Tokyo, Europe, Hong Kong, etc., have closer links and 
greater interests with the general public. Many comments and analytical articles are 
readily available on the electronic newspapers as well as the information providers. 
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The articles usually analyze movements of a particular market, typically in some 
indices (e.g. Dow Jones), with the reason(s) affecting it. These reasons can range from 
the recent movements of other influencing financial markets, the states of the current 
world economy and the outlook, possibility of outbreak of war in certain regions, to 
some micro factors such as policies of local government. 

Human can extract the knowledge rather easily but due to the rapid 
development of the financial situation, very few ordinary people can keep herself up- 
to-date with the large volume of news articles available. One alternative solution is to 
develop a knowledge extraction system whose main purpose is to analyze the news 
articles and provide us with the summaries. 

Full natural language understanding and processing has long been recognized as 
an important topic. However, we are still far away from a truly versatile and general 
system. By looking at a narrower domain, some researchers have report successes. 
Most Information Extraction Systems to-date, such as the current NYU proteus 
system and the SRI FASTUS system ^], use syntactic parsing as the main 
technology, while some use syntactic parsing with the aid of semantic analysis for 
solving their information extraction problems. 

In this paper, we explore a semantic expectation-based approach for the 
extraction of financial movement knowledge from news articles. Hong Kong stock 
market news is readily available to our research team over the internet and it is chosen 
as the topic of our study. In addition, we have on-line access to Reuters financial 
information service including financial news. It therefore becomes a part of the 
information sources. 

We first describe the semantic expectation-based knowledge extraction 
methodology (SEKE) for extracting market movements and the associated reasons in 
the following section. In section 3, we outline the studies carried out on Hong Kong 
Stock market and the findings are discussed in section 4. 



2 Semantic Expectation-Based Knowledge Extraction (SEKE) for 
Causation Knowledge 

The main purpose of an article is to convey information. In news articles about 
financial markets, the context of the information is more focused. Readers of these 
articles usually have certain semantic expectations in mind. Eor example, some may 
be interested in knowing the latest market movements, some may want to read about 
the analysis of cross-market influences, and some may want to know what causes the 
recent market to move up or down. 

To design an effective knowledge extraction system, we can learn from the 
relevant human behavior (as those developed in the field of Artificial Intelligence). 
Two main characteristics are being observed: 

1 . There is always an expected semantic in mind; and 

2. The expected semantic is used to guide the search and understanding. 

Based on these characteristics, we can design the semantic expectation-based 
knowledge extraction system (SEKE) for causation financial market movement 
knowledge analysis with the following assumptions: 
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(a) There are some semantic concepts associated with market movements. In the 
most general cases, there are three types of movements: upward, downward 
and no movement. 

(b) There are analyses about recent movements and the influencing factor(s) is 
discussed. The reason(s) and consequence (market movement) are presented 
together, most likely in the same sentence, or, close to each other. 

(c) Although there are many different ways to present the information (sentence 
styles), the same semantics are usually preserved. Semantic templates can be 
used to extract the encapsulated knowledge base on the expected semantics, 
and different sentence styles can be associated with the corresponding 
templates. 

(d) The reasons are usually restricted to those associated to the particular market 
and its movements from recent information sources. It is possible to generate a 
set of expected reasons at the beginning of the knowledge extraction process 
based on recent information. The same is applicable to the expected 
consequences. 

A causation relation typically has two entities, reason(s) and consequence, linked 
by a directional causation indicator. It is expected that market movement analysis also 
has these two main concepts and its semantic template can be illustrated using the 
following figure: 




Fig. 1. The causation semantic template 

This semantic template states that one or more reasons cause the occurrence of 
the consequence. As the semantic template is language independent but information is 
expressed in some languages, we have to associate the sentence styles (usually there 
are more than one sentence style to express a causality in a given natural language) 
pertaining to a particular language to it. These sentence styles are called the sentence 
templates. 

For a given market, SEKE requires a preliminary study on the expected 
semantics for the knowledge to be extracted. A set of example/historical texts from 
the same (or similar) sources will be useful for this purpose. Erom this set of training 
data, three groups of information are gathered: the sentence templates with possibly 
different styles matching the semantic template, the set of concepts matching the 
Reasons and the set of concepts matching the Consequence. Some examples of the 
sentence template, consequences and reasons are given below. 

Sentence template: 

The movement of stock is caused by a factor with its movement 
“Hang Seng Index rose as Wall Street gains” (an example sentence) 

where “Hang Seng Index rose” is the consequence, “Wall Street gains” is the 
reason, and the “as” is the English Language causation expression which links the 
reason to the consequence in the reversed order as compared to the causation 
semantic template in Figure 1 . This example is also a simplified version of Figure 
1 since it only has one reason. 






Semantic Expectation-Based Causation Knowledge Extraction 1 17 



Reasons: Wall Street gains, US interest rate down, Nasdaq sinks. 

Consequences: 

Hang Seng Index rise, Hong Kong Stocks gains ( upward movement) 

Hang Seng Index drops, Hong Kong Stocks sink (downward movement) 

Hang Seng Index unchanged, Hong Kong Stocks barely changed (no movement) 

Since most training data set cannot cover the full spectrum of the interested 
domain, SEKE introduces a way to conglomerate terms expressing similar concepts in 
future encounters. It makes use of an electronic thesaurus (e.g. WordNet [6], Roget’s 
Thesaurus [7]) to group unseen terms with the same category into the pre-defined 
concepts. The following figure shows an example on how an unseen work “rose'’ is 
absorbed into the upward movement concept using WordNet as the thesaurus in 
SEKE. 



Incomins Word “rose” 




Matching incoming word with 
Movement Concept Term 



Movement Concept Term 

Keyword Category 

Go up upward 

Fail downward 

Unchanged no movement 




Word “rose” does not exist in 
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in synonyms set and 
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Fig. 2. Movement Extraction through WordNet with an example movement concept term 

Due to the limited coverage of any given thesaurus, there is no guarantee that a 
new term can be found in the thesaurus and it is correctly classified in the context 
concerned. Thus, it is impossible to fully automate the concept grouping. Eor 
example, “Down Jones’’ and “Wall Street’’ denote the same entity to us but they may 
not be captured in the thesaurus used. Human identification of similar concepts has to 
be carried out from time to time to resolve the newly encountered terms. 

The procedure for the semantic expectation-based knowledge extraction (SEKE) 
system for causation knowledge is outlined below: 

(1) Determine the domain area and scope of the study. 

(2) Define the expected semantic template. If there are more than one thing to be 
extracted, multiple semantic templates have to be defined. 
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(3) Collect a set of training data set. 

(4) From the data, generate the reason and consequence sets, and the set of 
sentences matching the template. 

(5) Analyze and match the concepts and template to the sentences. Collect this set 
of styles with their correspondences to the semantic templates. 

(6) When a new article is fed to the system, use the semantic template with the 
collected sentence, reason and consequence sets to filter out unrelated 
information and zoom in to the possibly relevant sentence(s) that can he fully 
or partially matched. 

(7) If the sentence can match the expected semantic template, collect it. 

(8) Otherwise, use a thesaurus to check whether the unmatched terms has similar 
meaning to our collected concepts. If all terms are matched with the semantic 
template, then collect it. 

(9) For those terms that cannot be matched by the thesaurus, human inspection is 
required. 

(10) It is possible that a sentence matches the concept terms but fails on the sentence 

level. This may mean a possible unseen sentence style and human inspection 
has to be carried out to decide if it is the required knowledge. Collect it into the 
sentence concept knowledge base for future use if it expresses the intended 
knowledge. 



3 A Study on Hong Kong Market Movements and the Influencing 
Reasons 

A study to extract the reasons affecting the movements of Hang Seng Index (HSI) in 
Hong Kong Stock Market based on the SEKE framework is carried out. English news 
articles for the study were solicited from the most reliable and relevant sources 
available in Hong Kong, namely Reuters news. South China Morning Post and Hong 
Kong Standard, the two local English newspapers. 

Relevant news articles in December 1999 were collected as the training data. 
Erom this set of text, we analyze the single sentence knowledge expressing Hong 
Kong Stock movements with their influencing reasons. Our observations are 
summarized below: 

1) Both the consequences and reasons have some movements. The causality relation 
in this study is about how the movements of some factors affect the Hong Kong 
market’s movements (measured by HSI). Therefore, the concept of movement is 
common to both the reasons and the consequences. The corresponding semantic 
template becomes: 

Some samples sentences are : 

“The Hang Seng Index (HSI) rebounded as the interest rate rise. ” 

“Hong Kong's Hang Seng Index led the charge downwards, tumbling 7.18 
per cent to close at 15,846.72 points due to overseas market drop. ” 

“Hong Kong stocks opened moderately higher on Thursday led by 
technology-related companies after the U.S. Federal Reserve increased 
interest rates by a quarter point as predicted. ” 
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Reasons Consequence 




Fig. 3. The causation semantic template for the Hong Kong market movements 

2) The Hang Seng Index (HSI) movements can be divided into three categories and 
each is assigned a symbol. The categories are upward (-t), downward (-) and no 
movement (0). The concept seeding words are {up, rise, soar, gain, boost, high, 
rebound, strong, recover}, {down, fall, sink, lose, tumble, low, weak, sell-off, 
retreat] and [no change, consolidate, barely changed, easing] respectively. 

3) The most common factors in the reasons are: 

Wall Street/ Dow Jones, Nasdaq, US interest rate. Overseas market. 

Internal factors, e.g. technical stocks. 

They are collected from the training data set as the fundamental concepts for 
the reasons in the causation semantic template. They will have to be 
combined with the three categories of movements to form the complete 
semantic descriptions of a reason. For example: interest rate fall, overseas 
market rises. 

4) The set of causes for linking consequence to reasons is [as, “,”] and the set of 
conjunction terms amongst the factors in the reasons is {and, “, ”, but]. 

Based on the SEKE framework described in the previous section, a system for 
this study is implemented in Java and it is illustrated in Eigure 3 below. In this SEKE 
ystem, we highlight the human intervention steps with dotted lines. If these human 
assisted steps are included, we can achieve almost 100% accuracy in the analysis, 
they are therefore excluded from the evaluation of the experiment. Those news 
articles that cannot be processed automatically by the SEKE system (represented by 
solid lines in Figure 4) will be treated as incorrectly extracted (fail). 

The SEKE system for Hong Kong market movement analysis takes in raw news 
articles either from the Reuters news data feed, or electronic news articles from the 
web sites of the other two newspapers: South China Morning Post and Hong Kong 
Standard. Preprocessing is carried out to filter the irrelevant information and zoom in 
to the relevant sentences. This is achieved with the expected sentence templates and 
the semantics of the set of possible consequences generated from the initial training 
data set (e.g. HSI rose). Those sentences identified as relevant will be semantically 
parsed (concept term matching without syntactical knowledge) using the domain 
knowledge captured in the reason factors concept, consequence (HSI) concept and 
movement concept. Successful sentences will be collected in the form of the causation 
semantic template (Figure 3). 

Some sentences may not be parsed semantically. Since irrelevant sentences have 
been filtered earlier, there are only two possibilities here: either some movement 
terms, or, reason factors cannot be recognized at this stage. Each unseen movement 
concept term is passed to the module on movement extraction using WordNet (double 
outlined box in Figures 4 and 2). If the unseen term can be classified, then it is 
updated to the movement concept knowledge and the parsing of the original sentence 
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continues. If the term cannot be recognized by the module then it will be kept for 
human recognition and the sentence is considered failed in this study (marked as F2 
in Figure 4). For unseen reason factors, since they are compounded and specialized 
concepts (e.g. ‘'World Bank support”, “European Community’s intervention”) and 
most of them are not captured by the current thesaurus, they have to be decided by 
human recognition. Hence sentences containing this type of factors are considered 
failed in this study (FI in Figure 4). 



News Article 




Fig. 4. The SEKE system for Hong Kong Stock Movement Analysis 
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News articles of the three information sources from January to February 2000 are 
used as the testing data set in the experiment. These news articles were classified 
manually and they serve as the correct answers. A total of 81 relevant news sentences 
(only single sentence information is studied) were collected (a 100% systems success 
rate compared to human classifications). Among them, 30 have multiple reasons 
associated with a Hong Kong market movement while the other 51 have only single 

reasons. The results are given below and the precision and recalOl rates are 
summarized in Table 1. 

Single-reason knowledge 

Total number of relevant sentences: 51 

Correctly extracted by SEKE: 43 

Wrongly extracted by SEKE: 6 (complicated sentences and unseen reason factors) 

Fail/No output: 2 

Multi-reason knowledge 

Total number of relevant sentences: 30 

Correctly extracted by SEKE: 17 

Partially extracted by SEKE: 11 (only single reasons out of two were extracted, 

complicated phrases and unseen reason factors) 

Wrongly extracted by SEKE: 2 (complicated sentences and unseen reason factors) 

Fail/No output: 0 



Table 1. Summary of Experimental Results 



(11) Experiment Results in Precision and Recall 




Precision 


Recall 


Single-reason knowledge 


87.8% (43/49) 


84.3% 

(43/51) 


Correctly extracted Multi- 
reason knowledge 


Partial extracted Multi-reason 
knowledge (one reason 
extraction) 


56.7% 

(17/30) 


36.6% 

(11/30) 


56.7% 

(17/30) 


36.6% 

(11/30) 


Multi-reason knowledge (including one reason extraction as 
half of the rates) 


75% 

(17/30+0.5*11/30) 


75% 

(17/30+0.5*11/30) 


Combined result (only consider fully extracted knowledge) 


76% (60/79) 


74% (60/81) 


Combined result (including single reason extraction in multi- 
reason sentences as contributing to half of the rates) 


82.9% 


80.9% 



1 Precision refers to the reliability of the information extracted. Recall refers to how 
much of the information that should have been extracted was correctly extracted. 
Applied to the system operations, they are calculated as follow. 

number of correctly extracted knowledge 



Precision 



Recall 



total number of knowledge extracted 
number of correctly extracted knowledge 
total number of knowledge 
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4 Discussion and Future Work 

This paper proposes a semantic expectation-based knowledge extraction system 
SEKE for causation knowledge and a study using SEKE for extracting Hong Kong 
stocks market movement knowledge was carried out. By excluding human 
recognition in the SEKE framework, the experiment only measures the performance 
of the fully automated SEKE system. 

For single-reason sentences, an 88% precision and 84% recall rates were achieve. 
Eor multi-reason sentences (all test sentences only have two reasons), full 
knowledge extraction with all reasons and movements discovered is about 
57% for both the precision and recall; those have half of the reasons 
discovered constitute another 37% for both performance measurements. None 
of the multi-reason sentences has no knowledge extracted. The overall 
precision and recall are 76% and 74 % if we only consider full discovery of 
knowledge, and if we include partially discovered knowledge with a 0.5 factor 
(since one out of two reasons is extracted) then the rates improve to about 83% 
and 81% respectively. Errors contributing to wrong knowledge extraction of 
single-reason sentences are mainly due to the following three problems: 

1 . The movement of the reason cannot be extracted which causes an incomplete 
reason. For example: 

“Hong Kong stocks were lower in lackluster afternoon trade on Thursday 
with investors seeking out Cable & Wireless HKT <0008. HK> (C&W 
HKT) and some other technology stocks but shunning much of the rest of 
the market.” 

2. The movements for the reason are wrongly extracted due to complex sentence 
expression. Example: 

“Hong Kong stocks are expected to ease further on Thursday under steady 
pressure from interest rate concerns. “ 

In this example, steady were wrongly recognized as the movement for the 
reason factors where the correct one should be the increase of interest rate. 

3. Unseen reason factor and/or movement. Example: 

“Hong Kong stocks are expected to mirror Wall Street overnight action on 
Tuesday, with blue chips easing on interest rate fears but with technology 
stocks defying the trend. ” 

Here “overnight action ” was unrecognizable to the SEKE system. 

Eor multiple reason knowledge, there are a number of cases in which the 
SEKE system can only extract one out of two reasons. Although problems 
encountered by the single-reason sentence extraction may occur, our analysis reveals 
that the main problem is due to unseen reason factors. Some examples are shown 
below: 

“Hong Kong stocks finished strongly higher on Wednesday morning 
following a rebound on Wall Street and after a Hutchison Whampoa Ltd 
<0013. HK> Internet deal inspired fresh interest in technology stocks. ” 

“Hong Kong stocks finished Friday morning flat as a combination of 
interest rate concerns, covered warrants and fund outflows pared early 
gains. ” 

The third problem listed under single-reason sentence can be addressed by human 
recognition of the unseen concepts but when the sentence structure becomes too 
complex and there is no way to semantically match the expected concepts, the current 
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SEKE system fails to process the knowledge extraction task. Two cases were 
discovered for the single-reason sentences. 

By using the expected semantic knowledge to extract information from financial 
news articles, similar to how human beings understand texts, we have shown that it is 
a workable approach. It is simple and computationally less demanding than the 
syntactical, statistical and/or probabilistic approaches. The unresolved problem of 
complex sentences may not be easily tackled by these approaches as well. In our 
context, we believe that more complex semantic templates have to be defined and 
captured for this purpose, and we will look into this possible direction. 

SEKE system is an evolving system with the capability of incrementally 
improving itself with more input examples, we will have to try out different sequence 
of tests to check its average extraction capability. In terms of the granularity of the 
degree of modifier for the movement knowledge, we would also like to explore more 
detailed information. Distinctions such as “HSI rises”, ‘‘HSI rises sharply”, and “HSI 
rises moderately” will be incorporated. In addition, the causation semantic template 
generated from for the analysis of Hong Kong stocks is believed to be equally 
applicable to other market movement analysis, if the reason and consequence 
concepts are collected from the relevant training articles. Since the semantic template 
is language independent, the same process can be ported over to another language. 
This also sets our future direction. 
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Abstract. This paper describes a flexible and efficient toolbox based 
on the scripting language Python, capable of handling common tasks in 
data mining. Using either a relational database or flat files the toolbox 
gives the user a uniform view of a data collection. Two core features 
of the toolbox are caching of database queries and parallelism within 
a collection of independent queries. Our toolbox provides a number of 
routines for basic data mining tasks on top of which the user can add 
more functions - mainly domain and data collection dependent - for 
complex and time consuming data mining tasks. 

Keywords: Python, Relational Database, SQL, Caching, Health Data 



1 Introduction 

Due to the availability of cheap disk space and automatic data collection mech- 
anisms huge amounts of data in the Terabyte range are becoming common in 
business and science [7|. Examples include the customer databases of health 
and car insurance companies, financial and business transactions, chemistry and 
bioinformatics databases, and remote sensing data sets. Besides being used to 
assist in daily transactions, such data may also contain a wealth of information 
which traditionally has been gathered independently at great expense. The aim 
of data mining is to extract useful information out of such large data collec- 
tions |3j. 

There is much ongoing research in sophisticated algorithms for data min- 
ing purposes. Examples include predictive modelling, genetic algorithms, neural 
networks, decision trees, association rules, and many more. However, it is gen- 
erally accepted that it is not possible to apply such algorithms without careful 
data understanding and preparation, which may often dominate the actual data 
mining activity m. It is also rarely feasible to use off-the-shelf data mining 
software and expect useful results without a substantial amount of data insight. 
In addition, data miners working as consultants are often presented with data 
sets from an unfamiliar domain and need to get a good feel for the data and the 

* Corresponding author, E-Mail: 01e.Nielsen@anu.edu.au 

D. Cheung, G.J. Williams, and Q. Li (Eds.): PAKDD 2001, LNAI 2035, pp. 124-^^^ 2001. 
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domain prior to any ’’real” data mining. The ease of initial data exploration and 
preprocessing may well hold the key to successful data mining results later in a 
project. 

Using a portable, flexible, and easy to use toolbox can not only facilitate 
the data exploration phase of a data mining project, it can also help to unify 
data access through a middleware library and integrate different data mining 
applications through a common interface. Thus it forms the framework for the 
application of a suite of more sophisticated data mining algorithms. This paper 
describes the design, implementation, and application of such a toolbox in real- 
life data mining consultancies. 



1.1 Requirements of a Data Mining Toolbox 

It has been suggested that the size of databases in an average company dou- 
bles every 18 months | 2 ] which is akin to the growth of hardware performance 
according to Moore’s law. Yet, results from a data mining process should be 
readily available if one wants to use them to steer a business in a timely fashion. 
Consequently, data mining software has to be able to handle large amounts of 
data efficiently and fast. 

On the other hand, data mining is as much an art as a science, and real-life 
data mining activities involve a great deal of experimentation and exploration 
of the data. Often one wants to ’’let the data speak for itself’. In these cases 
one needs to conduct experiments where each outcome leads to new ideas and 
questions which in turn require more experiments. Therefore, it is mandatory 
that data mining software facilitates easy querying of the data. 

Furthermore, data comes in many disguises and different formats. Examples 
are databases, variants of text files, compact but possibly non-portable binary 
formats, computed results, data downloaded from the Web and so forth. Data 
will usually change over time - both with respect to content and representation 
- as will the demands of the data miner. It is desirable to be able to access and 
combine all these variants uniformly. Data mining software should therefore be 
as flexible as possible. 

Finally, data mining is often carried out by a group of collaborating re- 
searchers working on different aspects of the same dataset. A suitable software 
library providing shared facilities for access and execution of common opera- 
tions leads to safer, more robust and more efficient code because the modules 
are tested first by the developer and then later by the group. A shared toolbox 
also tends to evolve towards efficiency because the best ideas and most useful 
routines will be chosen among all tools developed by the group. 



This paper describes such a toolbox - called DMtools - developed by and aimed 
at a small data mining research group for fast, easy, and flexible access to large 
amounts of data. 
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The toolbox is currently under development and a predecessor has success- 
fully been applied in health data mining projects under the ACSys CRC0. It 
assists our research group in all stages of data mining projects, starting from 
data preprocessing, analysis and simple summaries up to visualisation and re- 
port generation. 



2 Related Work 

Database and data mining research are two overlapping fields and there are many 
publications dealing with their intersection. An overview of database mining is 
given in |0|. According to the authors the efficient handling of data stored in 
relational databases is crucial because most available data is in a relational form. 
Scalable and efficient algorithms are one of the challenges, as is the development 
of high-level query languages and user interfaces. Another key requirement is 
interactivity. 

A classification of frameworks for integrating data mining applications and 
database systems is presented in Three classes are presented: (1) Conven- 
tional - also called loosely coupled - where there is no integration between the 
database system and the data mining applications. Data is read tuple by tuple 
from a database, which is very time consuming. The advantage of this method is 
that any application previously running on data stored in a file system can eas- 
ily be changed, but the disadvantage is that no database facilities like optimised 
data access or parallelism are used. (2) In the tightly coupled class data in- 
tensive and time-consuming operations are mapped to appropriate SQL queries 
and executed by the database management system. All applications that use 
SQL extensions or propose such extensions to improve data mining algorithms 
are within this class. (3) In the black box approach complete data mining algo- 
rithms are integrated into the database system. The main disadvantage of such 
an approach is its lack of flexibility. Following this classification, our DMtools 
belong to the tightly-coupled approach, as we generate simple SQL queries and 
retrieve the results for further processing in the toolbox. As the results are often 
aggregated data or statistical summaries, communication between the database 
and data mining contexts can be reduced significantly. 

Several research papers address data mining based on SQL databases and 
propose extensions to the SQL standard to simplify data mining and make it 
more efficient. In ^ the authors propose a new SQL operator that enables effi- 
cient extraction of statistical information which is required for several classifica- 
tion algorithms. The problem of mining general association rules and sequential 
patterns with SQL queries is addressed in H2I, where it is shown that it is pos- 
sible to express complex mining computations using standard SQL. Our data 
mining toolbox is currently based on relational databases, but can also integrate 

^ ACSys CRC stands for ’Advanced Computational Systems Collaborative Research 
Centre’ and the data mining consultancies were conducted at the Australian National 
University (ANU) in collaboration with the Commonwealth Scientific and Industrial 
Research Organisation (CSIRO). 
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flat files. No SQL extension is needed, instead we put a layer on top of SQL 
where most of the ’’intelligent” data processing is done. Database queries are 
cached to improve performance and re-usability. 

Other toolbox approaches to data analysis include the IDEA (Interactive 
Data Exploration and Analysis) system [I2|, where the authors identify five gen- 
eral user requirements for data exploration: Querying (the selection of a subset 
of data according to the values of one or more attributes), segmenting (splitting 
the data into non-overlapping sub-sets), summary information (like counts or av- 
erages), integration of external tools and applications, and history mechanisms. 
The IDEA framework allows quick data analysis on a sampled sub-set of the 
data with the possibility to re-run the same analysis later on the complete data 
set. IDEA runs on a PC, with the user interacting on a graphical interface. 

Yet another approach used in the Control jSj project is to trade quality and 
accuracy for interactive response times, in a way that the system quickly returns 
a rough approximation of a result that is refined continuously. The user can 
therefore get a glimpse at the final result very quickly and use this information 
to change the ongoing process. The Control system, among others, includes tools 
for interactive data aggregation, visualisation and data mining. 

An object-oriented framework for data mining is presented in |14j . The de- 
scribed Data Miner’s Arcade provides a collection of APIs for data access, 
plug’n’play type tool integration with graphical user interfaces, and for com- 
munication of results. Access to analysis tools is provided without requiring the 
user to become proficient with the different user interfaces. The framework is 
implemented in Java. 



3 Choice of Software 

The DMtools are based on the scripting language Python [J) an excellent tool for 
rapid code development that meets all of the requirements listed in Section II . 1 1 
very well. Python handles large amounts of data efficiently, it is very easy to write 
scripts as well as general functions, it can be run interactively (interpretable) 
and it is flexible with regards to data types because it is based on general lists 
and dictionaries (associative arrays) , of which the latter are implemented as very 
efficient hash-tables. Functions and routines can be used as templates which can 
be changed and extended as needed by the user to do more customised analysis 
tasks. Having a new data exploration idea in mind the data miner can implement 
a rapid prototype very easily by writing a script using the functions provided by 
our toolbox. 

Databases using SQL are a standardised tool for storing and accessing trans- 
actional data in a safe and well-defined manner. The DMtools are accessing a 
relational database using the Python database API 0. Currently, we are using 
MySQL ^ for the underlying database engine, but modules for other database 
servers are available as well. Both MySQL and Python are freely available, li- 

^ Available from the Python homepage at http : //www. python, org/topics/database/ 
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Fig. 1. Architecture of DMtools 



censed as free software and enjoy very good support from a large user community. 
In addition, both products are very efficient and robust. 

3.1 Toolbox Architecture 

In our toolbox the ease of SQL queries and the safety of relational databases 
are combined with the efficiency of flat file access and the flexibility of object- 
oriented programming languages in an architecture as shown in Figure 0 Based 
on relational databases, flat files, the Web, or any other data source a Data 
Manager deals with retrieval, caching and storage of data. It provides routines 
to execute an arbitrary SQL query and to read and write binary and text files. 
The two important core components of this layer are its transparent caching 
mechanism and its parallel database interface which intercepts SQL queries and 
parallelises them on-the-fly. The Aggregation module implements a library of 
Python routines taking care of simple data exploration, statistical computa- 
tions, and aggregation of raw data. The Modelling module contains functions 
for parallel predictive modelling, clustering, and generation of association rules. 
Finally, the Report module provides visualisation and allows facilities for simple 
automatic report generation. 

Functions defined in the toolbox layer are designed to deal with issues specif- 
ically for a given data mining project, which means they use knowledge about 
a given database structure and return customised results and plots. This layer 
contains routines that are not available in standard data analysis or data mining 
packages. 

Example 1. Dictionary of Mental Health Medications 

A central object within the domain of health statistics is a cohort, defined here 
as a Python dictionary of entities like customers or patients fulfilling a given 
criterion. As one task in a data mining project might be the analysis of a group 
of entities (e.g. all patients taking certain medication), one can use the function 
get .cohort to extract such a cohort once and cache the resulting dictionary 
so it is readily available for subsequent invocations. Being interested in mental 
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health patients, one might define a dictionary like the one shown below and use 
it to get a cohort. 

mental_drugs = {’Depression’: [32654, 54306, 12005, 33421], 

’Anxiety’: [10249, 66241], 

’Schizophrenia’: [99641, 96044, 39561]} 

depressed = get_cohort (mental_drugs [’Depression’ ], 1998) 

Several kinds of analyses can be performed using a cohort as a starting point. 
For example the function plot _age_gender (depressed) provides barplots of the 
given cohort with respect to age groups and gender incorporating denominator 
data if available. Another function list_drug_usage (depressed) gives a list 
of all medication prescribed to patients in the given cohort. This list includes 
description of the drug, number of patients using it, the total number of pre- 
scriptions and the total cost of each drug. Routines from all modules can either 
be used interactively or added to other Python scripts to build more complex 
analysis tasks. 

4 Caching 

Caching of function results is a core technology used throughout DMtools in order 
to render the database approach feasible. We have developed a methodology for 
supervised caching of function results as opposed to the more common (and also 
very useful) automatic disk caching provided by most operating systems and 
Web browsers. 

Like automatic disk caching, supervised caching trades space for time, but 
the approach we use is one where time consuming operations such as database 
queries or complex functions are intercepted, evaluated and the resulting ob- 
jects are made persistent for rapid retrieval at a later stage. We have observed 
that many of these time consuming functions tend to be called repetitively with 
the same arguments. Thus, instead of computing them every time, the cached 
results are returned when available, leading to substantial time savings. The 
repetitiveness is even more pronounced when the toolbox cache is shared among 
many users, a feature we use extensively. This type of caching is particularly 
useful for computationally intensive functions with few frequently used combi- 
nations of input arguments. Supervised caching is invoked in the toolbox by 
explicitly applying it to chosen functions. Given a Python function of the form 
T = func(argl, . . . ,argn) caching in its simplest form is invoked by replacing 
the function call with T = cache (func, (argl, . . . ,argn)). 

Example 2. Function Caching 

Caching of a simple SQL query using the toolbox function execquery can be 
done as follows: 

database = ’CustomerData’ 

query = ’select distinct Customer ID, count (*) from °/,s ; ’ ’/.database 

customer_list = cache (execquery , (query) ) 
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Table 1. Function Caching Statistics 



Function Name 


Hits 


Time (sec) 
Exec Cache 


Gain(%) 


Size (MB) 


execquery 


4,149 


130 


6 


91.43 


4.53 


get_mbs_patients 


172 


1,281 


76 


93.92 


48.53 


get_selected_transactions 


420 


1,507 


5 


99.33 


6.67 


multiquery 


46 


133 


0 


99.69 


0.76 


simplequery 


5 


50 


0 


99.86 


0.08 


get .cohort 


168 


489 


0 


99.92 


0.20 


get.drug.usage 


95 


1,388 


0 


99.99 


0.02 



The user can take advantage of this caching technique by applying it to 
arbitrary Python functions. However, this technique has already been employed 
extensively in the Data Manager module so using the high level toolbox routines 
will utilise caching completely transparently with no user intervention - the 
caching supervision has been done in the toolbox design. For example, most of 
the SQL queries that are automatically generated by the toolbox are cached 
in this fashion. Generating queries automatically increases the chance of cache 
hits as opposed to queries written by the end user because of their inherent 
uniformity. In addition to this, caching can be used in code development for 
quick retrieval of precomputed results. For example if a result is obtained by 
automatically crawling the Web and parsing HTML or XML pages, caching 
will help in retrieving the same information later - even if the Web server is 
unserviceable at that point. 

The function get_cohort used in a particular project required on the average 
489 seconds worth of CPU time on a Sun Enterprise server and the result took 
up about 200 Kilobytes of memory. Subsequent loading takes 0.22 seconds - 
more than 2,000 times faster than the computing time. This particular function 
was hit 168 times in a fortnight saving four users a total of 23 hours of waiting. 
Table a shows some caching statistics from a real-life data mining consultancy 
in health services obtained from four users over two weeks. The table has one 
entry for each Python function that was cached. The second column shows how 
many times a particular instance of that function was hit, i.e. how many times 
results were retrieved from the cache rather than being computed. The third 
column shows the average CPU time which was required by instances of each 
function when they were originally executed, and the fourth column shows the 
average time it took to retrieve cached results for each function. The fifth column 
then shows the average percentile gain {{Exec — Cache) / Exec * 100) achieved 
by caching instances of each function, and the sixth column shows the average 
size of the cached results for each function. The table is sorted by average gain. 

If the function definition changes after a result has been cached or if the 
result depends on other files wrong results may occur when using caching in it 
simplest form. The caching utility therefore supports specification of explicit de- 
pendencies in the form of a file list, which, if modified, triggers a recomputation. 
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Other options include forced recomputation of functions, statistics regarding 
time savings, sharing of cached results, clean-up routines and data compression. 
Note that if the inputs or outputs are very large, caching might not save time 
because disk access may dominate the execution time. This is due to overheads 
consisting mainly of input checks, hashing and comparisons of argument list, 
as well as writing and reading cache files. If caching does not lead to any time 
savings, a warning is given. Very large datasets are dealt with through blocking 
into manageable chunks and separate caching of these. 

Example 3. Caching of XML Documents 

Supervised caching is used extensively in the toolbox for database querying but 
is by no means restricted to this. Caching has proven to be useful in other aspects 
of the data mining toolbox as well. An example is a Web application built on top 
of the toolbox which allows managers to explore and rank branches according 
to one or more user-defined features such as Annual revenue, Number of cus- 
tomers serviced relative to total population, or Average sales per customer. The 
underlying data is historical sales transaction data which is updated monthly, 
so features need only be computed once for new data when it is added to the 
database. Because the data is static, cached features are never recomputed and 
the application can therefore make heavy use of the cached database queries. 
Moreover, no matter how complicated a feature is, it can be retrieved as quickly 
as any other feature once it has been cached. In addition, the Web application 
is configured through an XML document defining the data model and describ- 
ing how to compute the features. The XML document must be read by the 
toolbox, parsed and converted into appropriate Python structures prior to any 
computations. Because response time is paramount in an interactive application, 
parsing and interpretation of XML is prohibitive, but by using the caching mod- 
ule, the resulting Python structures are cached and retrieved quickly enough for 
the interactive application. The caching function was made dependent on the 
XML file itself, so that all structures are recomputed whenever the XML file has 
been edited - for example to modify an existing feature-definition, add a new 
month, or change the data description. Below is a code snippet from the Web 
application. The XML configuration file is assumed to reside in sales. xml. The 
parser which builds Python structures from XML is called parse_config and 
it takes the XML filename as input. To cache this function, instead of the call 
(f eature_list , branch_list) = parse_conf ig(f ilenamie) we write: 

filename = ’sales. xml’ 

(f eature_list , branch_list) = cache (parse_config, (filename), 

dependencies = filename) 



5 Database Access and Parallelism 

The toolbox provides powerful and easy-to-use access to an SQL database using 
the Python database API. We are using MySQL but any SQL database known 
to Python will do. In its simplest form it allows execution of any valid SQL 
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query. If a list of queries is given, they are executed in parallel by the database 
server if a multiprocessor architecture is available and the results are returned 
in a list. 

The achievable speedup of this procedure will naturally depend on factors 
such as the amount of communication and the load balancing. For large results 
communication time will dominate the execution time thus reducing the speedup. 
In addition, the total execution time of a parallel query is limited by the slowest 
query in the list, so if the queries are very different in their complexity, speedup 
will only be modest. However, a well balanced parallel query where results are of 
a reasonable size can make very good use of a parallel architecture. For example, 
executing a parallel query over five tables of size varying from 250 thousand to 13 
million transactions took 2,280 seconds when run sequentially and 843 seconds 
when run in parallel. This translates into a speedup of 2.7 on five processors or 
an parallel efficiency of 0.54. 

The database interface makes use of supervised caching technology and 
caches the results of queries as described in the previous section. This can be en- 
abled or disabled through a keyword argument in the function execquery. The 
data_manager module also contains a number of functions to perform standard 
queries across several tables. One example is the function standardquery which 
takes as input two attribute names, Ai and A 2 , a list of (conforming) database 
tables, and an optional list of criteria to impose simple restrictions on the query. 
The function returns all occurrences of attribute A 2 for each distinct value of 
Al from all tables where all the given criteria are met (using conjunction). For 
example, the call 

standardquery ( ’Company ’ , ’Customer’, tables, 

[(’Year’, 1997), (’qtr’, 1)]) 

yields a count of customers for each company in the first quarter of 1997. Another 
example is the function jo inquery which improves the performance of of normal 
SQL joins. It takes as arguments a list of fields, a list of tables, a list of joins of 
the form ’tablel_name . field = table2_name . field’ and a list of conditions, 
and returns a dictionary of results. 



6 Applications 

To illustrate the application of the DMtools we give two examples designed and 
used for a health services data mining consultancy. The data collection we had to 
our disposal consisted of two tables, one containing medication prescriptions and 
the other containing doctor consultancies by patients. In addition, we had spe- 
cialty information about doctors and geographical information that associated 
each post code with one of seven larger area codes (like capital, metropolitan or 
rural). All patient and doctor identifiers were coded to protect individual pri- 
vacy. Finally, we had data describing different drugs and different treatments 
obtained from the Web. 
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Example 4- Doctors Prescription Behaviour 

In this example, we describe how we analysed prescription patterns of specialist 
doctors. The aim was to find unusual behaviour such as over-prescribing. In 
particular, we wanted to know for each specialist how many patients he or she 
serviced and how many of these were prescribed a particular type of medication. 

For this task, we linked medication prescription information with patient 
information for a user given doctor specialty. The domain specific function 

get_doctor_behaviour (items, years, specialty_code) 

takes as input a list of drug code items, a list of years (as we are interested in 
the temporal changes in prescription behaviour over a time period) and a doctor 
specialty code. 

The toolbox is used as follows: First, the cohort of all patients taking med- 
ications in the given items list is extracted from the medication prescription 
database and the cohort of all patients seeing doctors with the particular spe- 
cialty code is extracted from the doctor consultancy database. These lists are 
then matched resulting in the desired table (see example below). Sorting this 
table gives the highest ratio of prescriptions per patients which can lead to the 
detection procedure for over-prescribing. 

The first run of get_doctor_behaviour with items being psycho tropic drugs 
and specialty_code being psychiatrists, over a five years period, required a 
run time of about two hours, extracting about 115 psychiatrists, almost 10,000 
patients and more than half a million transactions to analyse. Subsequent stud- 
ies with different medication groups were each processed in less than a minute 
thanks to caching. 



Doctor Code 


1 




1995 1 


1996 1 


1997 1 


1998 


x42rl9$ 


1 Total 


Patients : 


424 1 


450 1 


241 1 


199 




1 Mental 


Patients : 


167 1 


198 1 


142 1 
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1 


Ratio : 


39"/. 1 


447. 1 


59"/. 1 


66"/. 


7’/,w#t0q 


1 Total 


Patients : 


372 1 


336 1 


335 1 


389 




1 Mental 


Patients : 


101 1 


115 1 


121 1 


156 




1 


Ratio : 


277. 1 


34"/. 1 


36"/. 1 


40"/. 



Example 5. Episode Extraction 

Episodes are units of health care related to a particular type of treatment for a 
particular problem. An episode may last anywhere from one day to several years. 
Analysing temporal episodic data from a transactional database is a hard task 
not only because there are very many episodes within a large database but also 
because episodes are complex objects with different lengths and contents. To 
facilitate better understanding and manipulation, the DMtools contain routines 
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Transactions for Patient TJPNC Transactions for Patient BTXOW 
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Jan-97 Dec-98 Jan-97 Dec-98 

Weeks in 1 997 to 1 998 Weeks in 1 997 to 1 998 





Fig. 2. Timelines: Medical services for two patients 



to examine their basic characteristics, like length, number of transactions, aver- 
age cost, etc. One example is the timelines diagram which displays all medical 
transactions for a single patient split up for different groups of items as is shown 
in Figure El Detecting and extracting episodes from a transactional database is 
very time consuming and caching of such a functions is very helpful. It is feasible 
to cache several hundred thousand episodes - even if the resulting cache file has 
a size of hundred Megabytes - because the access time to get all these episodes 
is reduced from hours to minutes. 



7 Outlook and Future Work 

The DMtools is a project driven by the needs of a group of researches who are 
doing consultancies in health data mining. With this toolbox we try to improve 
and facilitate routine tasks in data analysis, with an emphasis on the exploration 
phase of a data mining project. It is important to have tools at hand that help 
to analyse and get a ’’feel” for the data interactively in the early stages of a data 
mining project, especially if the data is provided from external sources. This is in 
contrast to many data mining and knowledge discovery algorithms that aim at 
extracting information automatically from the data without any guidance from 
the user. 

Ongoing work on the DMtools includes the extension of the toolbox with more 
analysis routines and the integration of algorithms like clustering, predictive 
modelling and association rules. As these processes are time consuming we are 
exploring methods to integrate external parallel applications (optimised C code 
using communication libraries like MPI [ID|). Building graphical user interfaces 
(GUI) on top of our toolbox, Web enabling interfaces and exporting results via 
XML are on our wish list as is the publication of the DMtools as a package under 
a free software licence. 
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Abstract. The standardized visual field assessment, which measures vi- 
sual function in 76 locations of the central visual area, is an important 
diagnostic tool in the treatment of the eye disease glaucoma. It helps de- 
termine whether the disease is stable or progressing towards blindness, 
with important implications for treatment. Automatic techniques to clas- 
sify patients based on this assessment have had limited success, primarily 
due to the high variability of individual visual field measurements. 

The purpose of this paper is to describe the problem of visual held clas- 
sihcation to the data mining community, and assess the success of data 
mining techniques on it. Preliminary results show that machine learn- 
ing methods rival existing techniques for predicting whether glaucoma is 
progressing — though we have not yet been able to demonstrate improve- 
ments that are statistically signihcant. It is likely that further improve- 
ment is possible, and we encourage others to work on this important 
practical data mining problem. 



1 Introduction 

Glaucoma, a disease that affects the optic nerve, is one of the leading causes 
of blindness in the developed world. Its prevalence in populations over the age 
of 40 of European extraction is about 1 to 2%; it occurs less frequently in Chi- 
nese populations, and much more often in people descending from West African 
nations j0| • There are several different types of glaucoma, but all share the char- 
acteristic that structural damage to the optic nerve tissue eventually leads to loss 
of visual function. If left untreated, or treated ineffectively, blindness ultimately 
occurs. 

The cause of glaucoma involves many factors, most of which are poorly under- 
stood. If diagnosed early, however, suitable treatment can usually delay the loss 
of vision. If a patient continues to lose visual function after the initial diagnosis, 
their glaucoma is said to be progressing. 
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Determining progression is of vital importance to patients for two reasons. 
First, if a patient’s vision continues to deteriorate under treatment, the treatment 
must be altered to save their remaining visual function. Second, progression, or 
rate of progression, is used as the main outcome measure of clinical research trials 
involving new glaucoma drugs. If effective medication is to be developed and 
made available for widespread use, accurate detection of progression is essential. 
Patient care is as dependent on monitoring the progression of glaucoma as it is 
on the original diagnosis of the disease. 

Unfortunately, there is no universally accepted definition of glaucomatous 
progression. Various clinical drug trials use different definitions, and criteria 
used by practicing ophthomologists in vision clinics also differ widely. All agree, 
however, that visual field measurements are an essential tool for detecting pro- 
gression. 

The next section explains the standard technique for measuring visual fields. 
Following that, we describe standard methods that have been used to determine 
progression. In Section 4 we introduce a dataset that has been assembled for 
the purpose of testing different progression metrics, and describe the data min- 
ing methods that we have applied. Then we summarize the results obtained, 
using the standard technique of pointwise linear regression as a benchmark. Sur- 
prising results were obtained using a simple IR classifier, while support vector 
machines equaled and often outperformed the benchmark. We also investigated 
many other techniques, including both preprocessing the data by smoothing, 
taking gradients, using t-statistics, and physiologically clustering the data; and 
different learning schemes such as decision trees, nearest-neighbor classifiers, 
naive Bayes, boosting, bagging, model trees, locally weighted regression, and 
higher-order support vector machines. Although we were able to obtain little 
further improvement from these techniques, the importance of the problem mer- 
its further study. 



2 Visual Field Measurement 

The standard visual field assessment currently employed in glaucoma manage- 
ment requires the patient to sit facing a half sphere with a white background 
that encompasses their entire field of vision. Typically only the central 30° of 
the visual field is tested. Subjects are instructed to fixate in the center of the 
hemisphere and press a button whenever they see a small white light flash in 
the “bowl.” Lights of varying intensities are flashed in 76 locations of the vi- 
sual field, and the minimum intensity at which the patient can see the target is 
recorded as their threshold. It is not feasible to spend more than about 10 sec- 
onds determining patient thresholds at each location because fatigue influences 
the results, but consistent threshold measurements are possible — particularly in 
reliable patients. Thresholds are scored on a logarithmic decibel scale. At any 
particular location a score of 0 dB indicates that the brightest light could not 
be seen (blindness), while 35 dB indicates exceptional vision at that point. 
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Fig.ni shows visual field measurements taken on the right eye of a progressing 
glaucoma patient at an interval of five years. Each number is a dB threshold 
value. There are 76 locations in each field, separated by 6° of visual angle. Note 
the blind spot (readings <1 dB) at (15,-3) where the optic nerve exits the 
eye. In a left eye this blind spot occurs at (-15,-3) — assuming the patient 
remains fixated on the center spot for the duration of the test. For convenience 
in data processing, the asymmetry between eyes is removed by negating left eye 
x-coordinates, so that all data is in a right eye format. 
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Fig. 1. Visual fields for a progressing right eye. (a) Baseline measurement, (b) the same 
eye five years later. The axes measure degrees of visual angle from the center fixation 
mark. 



To assist the diagnosis of glaucoma, it is useful to compare the thresholds 
recorded in a visual field to a database of normal thresholds. To monitor pro- 
gression, visual field measurements must be compared from one visit to the next, 
seeking locations whose thresholds have decreased. In Fig. E it is obvious that 
visual sensitivity around (15,15) and (9,-15) has decreased from Fig. [I](a) to 
Fig. mb). Visual fields are quite noisy, however, so thresholds from a large series 
of visits are usually required to distinguish true progression from measurement 
noise. For example, the threshold at location (—9, —27) in Fig. [Dincreases from 
3 dB to 10 dB, which in this case is probably because the original estimate of 
3 dB is low. The entire field can fluctuate from visit to visit (“long term variabil- 
ity” in the ophthomology literature) depending on factors such as the patient’s 
mood and alertness, as well as physiological factors like blood pressure and heart 
rate. Moreover, each location can vary during the test procedure (“short term 
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variability”) depending on a patient’s criterion for pressing the response button, 
fatigue, learning effects, and mistakes. 

The challenge is to detect progression as early as possible, using the smallest 
number of sequential visual fields. 



3 Determining Progression 



Glaucoma patients are usually tested at yearly or half-yearly intervals. When 
presented with visual fields from successive visits, the ophthomologist’s task is 
to decide whether change has occurred, and if so whether it indicates glaucoma 
or is merely measurement noise. This paper casts the decision as a classification 
problem: patients must be classified as stable or progressing based solely on their 
visual field measurements. Several automatic techniques exist to aid the clinician 
in this task; they can be divided into three broad classes. 

The first group bases the classification on “global indices,” which average 
information across all locations of the visual field. The most commonly used 
such measure is mean deviation, which averages the difference between measured 
thresholds and a database of thresholds in non-glaucomatous eyes over the visual 
field. Each location is weighted, based on the variability of normal thresholds. 
The spatial variance of this measure (referred to as pattern standard deviation) 
is also used clinically. A third global technique assigns scores to locations based 
on their threshold values and those of their immediate neighbors, and sums the 
scores into a single measure of the visual field H31. Studies have shown that 
regression on global indices alone is a poor indicator of progression 12012^ . 

Classifiers in the second group treat each location independently of the others. 
By far the most common approach is to use pointwise univariate linear regres- 
sion (PWLR) to detect trends in individual locations [2ISI2QI22i2,3l27) . Typically 
this is applied to each individual location, and if the slope of the fitted line is 
significantly less than zero the patient is classified as progressing. Several vari- 
ations on this theme have been investigated, the most notable being to correct 
for natural age-related decline in thresholds Performing PWLR on each of 
the 76 locations introduces a multiple-comparison problem. This is solved previ- 
ously using multiple t-tests with a Bonferroni correction, ignoring the fact that 
locations in the visual field are not strictly independent. 

Other pointwise techniques have been investigated, including multivariate 
regression PI]| and using high-order polynomials to fit threshold trends 1201. 
The glaucoma change probability uni calculates the likelihood that the difference 
between a threshold and a baseline measure falls outside the 95% confidence 
limits established by a database of stable glaucomatous visual fields. 

All these pointwise techniques can be refined by requiring that a cluster of 
points show progression, rather than a single location. Alternatively, progression 
can be confirmed on subsequent tests by requiring that points not only show 
progression after examining n visual fields, but also when n + 1 fields are exam- 
ined. These techniques are currently employed in combination with the glaucoma 
change probability as outcome measures in several drug trials ini. 
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The final group of classifiers falls between the two extremes of global indices 
and pointwise modeling. These classifiers attempt to take account of the spatial 
relationship between neighboring locations. Some approaches cluster the loca- 
tions based on known physiological retinal cell maps , and apply regression 

to the mean of the clusters |2()I22| . Others use neural networks as classifiers, rely- 
ing on the network to learn any spatial relations mnm and references therein] . 

It is difficult to compare different approaches, because most studies do not 
report standard classification metrics. Moreover, patient groups used in experi- 
mental studies differ in size, stage of glaucoma, and type of glaucoma. To provide 
a baseline for comparison, we include results from one of the best-performing 
statistical methods, pointwise univariate regression, which is specifically tailored 
to this application p]]. This technique is detailed in Section ^21 below. 

4 Experiments 

Diagnosing glaucoma progression is a prime example of a medical classifica- 
tion task that is ideally suited for data mining approaches because pre-classified 
training data, although arduous to collect, is available. However, apart from neu- 
ral networks m no standard data mining paradigms have been applied to this 
problem, and there is no evidence that neural networks outperform well-known 
special-purpose statistical algorithms designed for this application. 

We present an empirical evaluation of two standard data mining algorithms, 
support vector machines 0 and IR decision stumps and show that they 
detect progressing glaucoma more accurately than pointwise univariate regres- 
sion. 

4.1 The Data 

Data was collected retrospectively from patient charts of 113 glaucoma patients 
of the Devers Eye Institute, each having at least 8 visual field measurements 
over at least a 4 year period as part of their regular ophthomologic examination. 
Unlike many previous studies, no special efforts were made to ensure patient 
reliability, nor was the quality of the visual fields evaluated. These are typical 
patient records from a typical clinical situation. 

The patients were classed as progressing , stable, or unknown by an expert 
(author CAJ), based on optic disk appearance and their visual field measure- 
ments. The final data set consisted of 64 progressing eyes and 66 stable eyes, 
each with eight visual field measurements at different points in time. The visual 
field threshold values were adjusted for age at measurement by 1 dB per decade, 
so in effect all eyes were from a 45 year old patient. Left eyes were transformed 
into right eye coordinates. 

4.2 The Methods 

Pointwise univariate linear regression analysis (PWLR) is a well-established di- 
agnostic tool in the ophthomology literature. In an empirical comparison of sev- 
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eral statistical methods for detecting glaucoma progression — pointwise univari- 
ate regression, univariate regression on global indices, pointwise and cluster-wise 
multivariate regression, and glaucoma change analysis — it emerged as the most 
accurate predictor of clinician diagnosis (Table 4 in |2D].) It consists of three 
steps. First, a linear regression function is fitted to each of the 76 locations in a 
visual field, using all available measurements for the patient (8 in the data we 
used.) Second, a significance test is performed on those locations with a negative 
slope to ascertain whether the decrease in sensitivity is statistically significant. 
Third, a series of visual field measurements is declared to be progressing if at 
least two adjacent test locations are deemed significant, and stable otherwise. 

In our experiments, we used a one-sided f-test to detect progression 1^. 
We experimented with different significance levels a and report results for both 
a = 0.01 and a = 0.05. We did not adjust for multiple significance tests using 
the Bonferroni correction (which would result in significantly smaller significance 
levels) because, like Nouri-Madhavi m we have noticed that it fails to detect 
many cases of progressing glaucoma. 

In contrast to this statistical approach, data mining algorithms exploit train- 
ing data to build a classification model that can diagnose progressing glaucoma. 
In our experiments two very different data mining approaches turned out to per- 
form well: linear support vector machines and IR decision stumps. We describe 
these next. In Section oi we mention other approaches that failed to improve 
upon them. 

Linear support vector machines (LSVM) construct a hyperplane in the input 
space to classify new data. All data on one side of the hyperplane is assigned to 
one class; all that on the other side to the other class. Unlike hyperplanes con- 
structed by other learning algorithms — for example, the perceptron — support 
vector machine hyperplanes have the property that they are maximally distant 
from the convex hulls that surround each class (if the classes are linearly sep- 
arable.) Such a hyperplane is called a “maximum-margin” hyperplane, and is 
defined by those feature vectors from the training data that are closest to it. 
These are called “support vectors.” 

Support vector machines are very resilient to overfitting, even if the feature 
space is large — as it is in this application — because the maximum-margin hyper- 
plane is very stable. Only support vectors influence its orientation and position, 
and there are usually only a few of them. If the data is not linearly separable, for 
example, because noise is present in the domain, learning algorithms for support 
vector machines apply a “soft” instead of a “hard” margin, allowing training in- 
stances to be misclassified by ending up on the “wrong” side of the hyperplane. 
This is achieved by introducing an upper bound on the weight with which each 
support vector can contribute to the position and orientation of the hyperplane. 

The glaucoma data is very noisy: it is often hard even for experts to agree 
on the classification of a particular patient. Thus our experiments impose a low 
upper bound on the support vectors’ weights, namely 0.05. To learn the support 
vector machines we employed the sequential minimal optimization algorithm 
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with the modifications described in M- An implementation of this algorithm is 
included in the WEKA machine learning workbench m 

The most straightforward way to apply support vector machines to this prob- 
lem is to use each individual visual field measurement as an input feature. With 
8 visual field measurements and 76 locations for each one, this produces 608 
features. However, proceeding this way discards valuable information — namely 
the fact that there is a one-to-one correspondence between the 76 locations for 
different visual field measurements. A more promising approach is to use the per- 
location degradation in sensitivity as an input feature, yielding 76 features in 
total. This produced consistently more accurate results in our experiments. We 
experimented with three different ways of measuring degradation: (a) using the 
slope of a univariate regression function constructed from the 8 measurements 
corresponding to a particular location, (b) using the value of the t-statistic for 
that slope, and (c) simply taking the difference between the first and the last 
measurement. Surprisingly, methods (a) and (b) did not result in more accurate 
classifications than (c). Thus the experimental results presented in Section Ol 
are based on method (c). 

A slightly more sophisticated approach is to sort the per-location differences 
into ascending order before applying the learning scheme. This is motivated by 
the fact that different patients exhibit progressing glaucoma in different loca- 
tions of the visual field. A disadvantage is that it prevents the learning scheme 
from exploiting correlations between neighboring pixels. With this approach, the 
learning scheme’s first input feature is the location exhibiting the smallest de- 
crease in sensitivity, and its 76th feature is the location exhibiting the largest 
decrease. The median is represented by the 38th feature. In our experiments, the 
sorted differences produced more accurate predictions. All the results presented 
in Section Prrn a, re based on this approach. 

Compared to a support vector machine, a IR decision stump is a very ele- 
mentary classification scheme that simply determines the single most predictive 
feature and uses it to classify unknown feature vectors. Numeric features are 
discretized into intervals before applying the scheme. If the value of the chosen 
feature of a test instance falls into a particular interval, that instance is assigned 
the majority class of the training examples in this interval. The discretization 
intervals for a particular feature are constructed by sorting the training data 
according to the value for that feature and merging adjacent feature vectors of 
the same class into one interval. To prevent overfitting, one additional constraint 
is employed: the majority class in a particular interval must be represented by 
a certain minimum number of feature vectors in the training data. Holte P2] 
recommends a minimum of 6 as the threshold: this is what we used in our ex- 
periments. 

As with support vector machines, we used the sorted differences in sensitivity 
between the first and last measurement for each location as a set of 76 input 
features for IR. Applying this single-attribute scheme on the unsorted differences 
is not appropriate: classifications would be based on the single test location in 
the series of visual field measurements that is the most predictive one across all 
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patients in the training data. It is unlikely that the same location is affected 
by progressing glaucoma in every patient. As we show in the next section, IR 
tends to choose a feature for prediction that is close to the median per-location 
difference. 



4.3 The Results 

Tables 0 and summarize the experimental results that we obtained by apply- 
ing the three different techniques to our dataset consisting of 130 visual field 
sequences. For PWLR we show results for two different significance levels a, 
namely 0.05 and 0.01. All performance statistics are estimated using stratified 
10-fold cross-validation |2B1- The same folds are used for each method. Standard 
deviations for the 10 folds are also shown. Note that PWLR does not involve 
any training. Thus cross-validation is not strictly necessary to estimate its per- 
formance. However, it enables us to compare PWLR to the other methods and 
lets us test potential performance differences for statistical significance. 

Tabled shows the estimated percentage of correct classifications. Note that 
50.8% is the accuracy achieved by a classifier that always predicts the majority 
class (i.e. “stable glaucoma”). The five different columns correspond to different 
numbers of visual fields: for the column labeled “8 VFs” we used all 8 visual field 
measurements corresponding to a particular patient to derive a classification 
(and for training if applicable), for the column labeled “7 VFs” we used the first 
7, etc. The classification problem gets harder as fewer visual field measurements 
become available, and corresponds to an earlier diagnosis. 

Table d shows estimated performance according to the kappa statistic d . 
This statistic measures how much a classifier improves on a chance classifier 
that assigns class labels randomly in the same proportions. It is defined by: 



Pc -Pr 
I - Pr ’ 



( 1 ) 



where Pc is the percentage of correct classifications made by the classifier under 
investigation, and Pr is the corresponding expected value for a chance classifier. 
Following the convention established by Landis and Koch values of kappa 
above 0.8 represent excellent agreement, values between 0.4 and 0.8 indicate 
moderate-to-good agreement, and values less than 0.4 represent poor agreement. 

The first observation is that the accuracy of the different methods depends on 
the number of visual field measurements that are available. With the exception 
of PWLR given a setting of a = 0.05 (for reasons explained below), all methods 
achieve kappa values greater than 0.4 for 7 and 8 VFs, and exhibit a decline in 
performance as fewer VFs become available. For 6 and fewer VFs all estimates 
of kappa are below 0.4. 

The second observation is that the performance of PWLR depends on an 
appropriately chosen significance level. According to a paired two-sided t-test0 
PWLR with a = 0.01 performs significantly better than PWLR with a = 0.05 

Significant at the 0.05%-level. 
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Table 1. Percent correct, and standard deviation, for different numbers of visual field 
measurements (estimated using stratified 10- fold cross-validation). 



Method 


8 VFs 


7 VFs 


6 VFs 


5 VFs 


4 VFs 


PWLR (a=0.01) 


75.4F8.7 


72.3T10.4 


62.3T9.2 


60.8T9.2 


53.9T6.3 


PWLR (a=0.05) 


61.5±8.1 


60.8T8.5 


61.5±8.9 


60.8T8.5 


63.1T8.7 


IR 


80.0T9.0 


78.5T12.5 


64.6T13.2 


61.5±11.5 


56.2T10.3 


LSVM 


75.4F9.5 


72.3F8.3 


68.5T6.7 


61.5T8.9 


63.9T12.1 


Table 2. Value of kappa statistic, and standard deviation. 


for different numbers of 


visual field measurements (estimated using stratified 10-fold 


cross-validation) . 


Method 


8 VFs 


7 VFs 


6 VFs 


5 VFs 


4 VFs 


PWLR (a=0.01) 


0.51T0.18 


0.45F0.2 


0.24T0.18 


0.2F0.17 


0.06T0.13 


PWLR (a=0.05) 


0.24T0.16 


0.23T0.15 


0.23T0.19 


0.23F0.15 


0.25T0.19 


IR 


0.59T0.19 


0.56T0.25 


0.28T0.27 


0.23T0.22 


0.12T0.2 


LSVM 


0.50T0.19 


0.44T0.17 


0.36T0.14 


0.23T0.18 


0.28T0.24 



for 8 and 7 VFs (according to both percent correct and kappa); for 6 and 5 
VFs there is no significant difference between the two parameter settings; and 
for 4 VFs a = 0.05 significantly outperforms a = 0.01 according to the percent 
correct measure. 

The reason for this result is that a = 0.05 is too liberal a significance level 
if 7 or 8 VFs are available: it detects too many decreasing slopes in the series of 
visual field measurements, consequently classifying too many glaucoma patients 
as progressing. On the other hand, a = 0.01 is too conservative if only a few VF 
measurements are present: with 4 VFs it almost never succeeds in diagnosing 
progressing glaucoma. 

The third observation is that LSVM does not share this disadvantage of 
PWLR. It does not require parameter adjustment to cope with different numbers 
of visual field measurements. For 7 and 8 VFs it performs as well as PWLR with 
a = 0.01 and significantly better than PWLR with a = 0.05; for 6 VFs it 
performs better than both; and for 4 VFs it performs as well as PWLR with 
a = 0.05 and significantly better than PWLR with a = 0.01. 

The fourth observation is that the IR-based method is the best-performing 
one for 8 and 7 VFs. For 6 and less VFs it appears to be less accurate than 
LSVM. However, the only differences that are statistically significant occur for 7 
and 8 VFs between IR and PWLR using a = 0.05. Interestingly, if 7 or 8 VFs are 
available, IR consistently bases its predictions on the 33rd-largest per-location 
difference. Forcing the scheme to use the median per-location difference — the 
38th-largest difference — slightly decreases performance. Lfnfortunately, due to 
lack of additional independently sampled data, we could not test whether this is 
a genuine feature of the domain or just an artifact of the particular dataset we 
used. 
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4.4 Other Approaches 

During the course of our experiments we tried many other learning schemes and 
data pre-processing regimes: 

• Replacing the above classifiers with decision trees, nearest-neighbor rules, 
naive Bayes, model trees, and locally weighted regression I2HJ. We also tried 
higher-order support vector machines, which are able to represent non-linear 
class boundaries |Z]. 

• Performance enhancing wrappers applied to the above classifiers such as 
boosting I2E1, bagging and additive regression mu. 

• A custom stacking approach where several classifiers are built using the 
difference between the zth and the (z-l)th visual field and a meta-classifier 
to arbitrate between the predictions of these base classifiers. 

• Smoothing the measurements for each location using a rectangular filter 
before applying the learning scheme (with varying filter sizes.) 

• Using physiologically clustered data where the 76 test locations are reduced 
into several cluster based on physiological criteria 

• Using simulated data m in addition to data based on real patients. 

• Focusing on the N locations with the largest decrease in sensitivity and using 
only those points, along with some of the surrounding ones, as the input to 
the learning scheme. 

• Using the number of per-location-decreases as feature value instead of the 
difference between the first and the last measurement. 

• Adding the difference between the mean of the per-location measurements 
for the first and the last visual field as an extra attribute. 

None were able to improve on the results reported above. 

5 Discussion 

Determining whether a glaucoma patient has deteriorating vision on the basis of 
visual field information is a challenging, but important, task. Each visual field 
assessment provides a large number of attributes, and the task lends itself to 
data mining approaches. We hope that exposing the problem to the data mining 
community will stimulate new approaches that reduce the time required to detect 
progression. 

The results of our experiments show that both IR and support vector ma- 
chines appear to improve on pointwise univariate regression, a method that is 
commonly used for glaucoma analysis. Among all methods tested, IR produced 
the best results if a series of eight or seven visual field assessments is available for 
each patient. If six or fewer measurements are available, a linear support vector 
machine appears to be the better choice, and we found it to be the most reli- 
able method across different numbers of measurements. Compared to pointwise 
univariate regression analysis, whose performance depends strongly on an ap- 
propriately chosen significance level, this stability is an outstanding advantage. 
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We have focused on white-on- white visual field data measured in 76 loca- 
tions. However, many other attributes are available that assist in progression 
analysis, and these may help learning schemes to yield more accurate results. 
For example, there are other visual field tests that detect glaucoma earlier and 
signal progression more rapidly, such as blue-on- yellow perimetry and Frequency 
Doubling Technology perimetry. The latter is particularly promising because it 
appears less variable than white-on-white perimetry 0. 

Supplementary data is often available from measures of the structure of the 
optic nerve head. For example, confocal scanning laser tomography obtains 32 
optical sections of the optic nerve with a laser and uses them to generate a 3- 
D topographic map P). This allows changes in nerve head shape, which may 
signal progression of glaucoma, to be quantified. However, these measures are 
very noisy, and it remains to be seen whether they can increase accuracy. 

A serious impediment to research is the effort required to gather clean data 
sets. Not only is it arduous to collect clinical data, but significant expert time 
is needed to classify each patient. Computer simulation of visual field progres- 
sion, which generates data that closely models reality, may offer an alternative 
source of training data An added advantage is that simulated visual fields 
are known to be progressing or stable by design, so the classification operation 
introduces no noise into the data. 
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Abstract. In this paper we examine the use of Bayesian networks (BNs) 
for improving weather prediction, applying them to the problem of pre- 
dicting sea breezes. We compare a pre-existing Bureau of Meteorology 
rule-based system with an elicited BN and others learned by two data 
mining programs, TETRAD II [Spirtes et ai., 1993| and Causal MML 
[Wallace and Korb, 1999| . These Bayesian nets are shown to significantly 
ontperform the rule-based system in predictive accnracy. 



1 Introduction 



Bayesian networks have rapidly become one of the leading technologies for ap- 
plying artificial intelligence to real-world problems, as a decision support tool 
for reasoning under uncertainty. The usual approach is one of knowledge en- 
gineering: elicit the causal structure and conditional probabilities from domain 
experts. This approach, however, suffers from the same headaches that accom- 
panied early expert systems; frequently, for example, experts are unavailable or 
they generate inconsistent probabilities jWallsten and /wick, 199? . This has led 
to a recent upsurge in interest in the automated learning of Bayesian networks. 
Here we examine the use of Bayesian networks (BNs) in improving weather 
prediction. In particular, we compare a pre-existing rule-based system for the 
prediction of seas breezes provided by the Australian Bureau of Meteorology 
(BOM) with BNs developed by expert elicitation and BNs learned by two ma- 
chine learning programs, TETRAD II (from ^Spirtes et ah, 19^ ) and Causal 
MML (CaMML; [Wallace and Korb, 1999| ). We first describe the domain, BN 
methodology and data mining methods and then examine predictive accuracy. 



2 The Seabreeze Prediction Problem 

Sea breezes occur because of the unequal heating and cooling of neighbouring 
sea and land areas. As warmed air rises over the land, cool air is drawn in from 
the sea. The ascending air returns seaward in the upper current, building a 
cycle and spreading the effect over a large area. If wind currents are weak, a sea 
breeze will usually commence soon after the temperature of the land exceeds that 
of the sea, peaking in mid-afternoon. A moderate to strong prevailing offshore 
wind will delay or prevent a sea breeze from developing, while a light to moderate 
prevailing offshore wind at 900 metres (known as the gradient level) will reinforce 
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a developing sea breeze. The sea breeze process is also affected by time of day, 
prevailing weather, seasonal changes and geography |Batt, lyDbjBethwaite, 19^ 
[Houghton, 1992| . 

BOM provided a database of meteorological information from three types of 
sensor sites in the Sydney area. We used 30MB of data from October 1997 to Oc- 
tober 1999, with about 7% of cases having missing attribute values. Automatic 
weather stations (AWS) provided ground level wind speed (ws) and direction 
(wd) readings at 30 minute intervals (date and time stamped). Olympic sites 
provided ground level wind speed (ws), direction (wd), gust strength, tempera- 
ture, dew temperature and rainfall. Weather balloon data from Sydney airport 
(collected at 5am and 11pm daily) provided vertical readings for gradient-level 
wind speed (gws), direction (gwd), temperature and rainfall. (Predicted variables 
are wind speed [wsp] and wind direction [wdp] below.) 

Seabreeze forecasting is currently done using a simple rule-based system, 
which predicts them by applying several conditions: if the wind is offshore and 
is less than 23 knots, and if part of the forecast timeslice falls in the afternoon, 
then a sea breeze is likely to occur. Its predictions are generated from wind 
forecasts produced from large-scale weather models. According to BOM, this 
rule-based system is the best they have been able to produce and correctly 
predicts approximately two thirds of the time. 



3 Bayesian Network Methodology 

Bayesian methods provide a formalism for reasoning under conditions of uncer- 
tainty. A Bayesian network is a directed acyclic graph representing a probability 
distribution Pearl, 1988| . Network nodes represent random variables and arcs 
represent the direct dependencies between variables. Each node has a conditional 
probability table (CPT) which indicates the probability of each possible state of 
the node given each combination of parent node states. The tables of root nodes 
contain unconditional prior probabilities. 

A major benefit of BNs is that they allow a probability distribution to be 
decomposed into a set of local distributions. The network topology indicates how 
these local distributions should be combined to produce the joint distribution 
over all nodes in the network. This allows the separation of the quantification of 
influence strengths from the qualitative representation of the causal influences 
between variables, making the knowledge engineering and/or interpretation of 
BNs significantly easier. The task of building a Bayesian model can therefore be 
split in two: the specification of the structure of the domain, and the quantifica- 
tion of the causal influences. Various tools for efficient inference in BNs have been 
developed; we used Netica [Norsys, 2000| . In the remainder of this section we 
describe the tasks involved in applying BNs to the Seabreeze prediction problem. 

Netica learns BN CPTs by counting combinations of variable occurrences, a 
method developed by jSpiegelhaiter and Lauritzen, 199U| . We applied this tech- 
nique to all networks, but only after the qualitative causal structure was fixed, 
either by expert elicitation or the causal discovery methods described below. 

The first method we used for network construction was expert elicitation, 
with meteorologists at the BOM (Figure CJ a). The links between network nodes 
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described causal relationships between the wind to be predicted and the current 
wind, the time of day, and the month of the year. Arc direction was selected to 
reflect the temporal relationship between variables. 




Fig. 1. Airport data networks - a) CaMML, b) TETRAD II with prior temporal or- 
dering, c) Expert elicitation 



Learning causal structure by testing data for conditional independencies (Cl 
learning) was introduced by |Verma and Pearl, 1991j . The basic algorithm pre- 
sumes the existence of an ‘oracle’ which can provide a true or false answer to 
any question of type A II 1^15 (i.e., X and Y are independent given the variables 
in set S). The algorithm is not guaranteed to discover the original Bayesian 
network, but will And all direct arcs between nodes and orient many of them. 
TETRAD II [Spirtes et ah, 19^ provided the first practical implementation of 
this algorithm, replacing an oracle with significance tests for vanishing condi- 
tional dependencies. TETRAD II asymptotically obtains the causal structure 
of a distribution to within the statistical equivalence class of the true model. 
Unfortunately, as a Cl learner, TETRAD II does not always specify the direc- 
tion of a link between nodes. To compensate for this, it is possible to specify a 
partial temporal ordering of variables, should the user have such prior informa- 
tion. Therefore, we generated two networks with TETRAD II for each data set: 
the first simply from the data, and the second using a full temporal ordering of 
variables (Figure^). The first run was performed to allow a fair comparison of 
TETRAD IPs performance with CaMML, given no extra domain information. 

An alternative type of causal learning employs a scoring metric to rank each 
network model and searches through the model space, attempting to maximize 
its metric. MML (Minimum Message Length) |Wallace and Boulton, 195S| uses 
information theory to develop a Bayesian metric. MML metrics for causal models 
have been developed by [Wallace et al., 1996|Wallace and Korb, 1999| . Here, we 
applied CaMML (Causal MML; [WaI!ac^m^^o^^^^^[ )^ which conducts the 
MML search through the space of causal models using stochastic sampling. 

4 Experimental Results 

Here we consider the performance of the different tools in sea breeze prediction. 
First we compare the BNs with the existing predictor provided by BOM, and 
then we compare the different BNs against each other. 
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All of the BNs were trained on weather data provided by BOM. The elicited 
BN was parameterized by Netica using those data; the TETRAD II and MML 
BN structures were learned from the data and then parameterized by Netica. We 
examined four different testing regimes: using 1997 data for training and 1998 
data for testing; randomly selecting 80% of cases from one year and using the 
remainder for testing; the same 80-20 split, but using data from both years; and 
incremental training and testing, i.e., training from all data prior to the date of 
the prediction, and compiling prediction accuracy results over a full year. Most of 
the results below concern predictive accuracy determined by the third method. 
In these cases the random selection of training and test data was performed 15 
times and confidence intervals computed in order to check statistical significance; 
in general, differences in accuracy > 10% are significant at the 0.05 level. 

The BOM estimated that the predictive accuracy of the rule-based system 
(RB) was approximately 60 to 70 per cent (more detailed statistics were unavail- 
able). A sea breeze is defined as occurring when and only when the gradient wind 
direction is offshore and the ground level wind onshore. Weather balloons pro- 
viding usable data were launched from Sydney airport twice a day, at about 5am 
and 11pm. Predictions were produced over a period of two years, from both sets 
of data, with the lookahead time of the prediction in increments of three hours. 
Predictions of both seabreeze existence and (more interesting and difficult) wind 
direction were generated for each AWS site (see IKennett et ah, 200 1| for full 
results). The predictive accuracy of the system varied in a rough sine pattern, 
with a maximum of 0.8 reached at about 4pm. Wind direction accuracy was 
approximately 10% lower, in the same pattern. 

We tested four Bayesian networks for predictive accuracy (see Figure P), one 
discovered by MML, two by TETRAD II and one elicited from experts. Since 
TETRAD II without the aid of a prior temporal ordering produced undirected 
arcs and cycles, its network was modified by resolving inconsistencies and am- 
biguities to TETRAD IPs advantage. The accuracy results for these four BNs, 
together with the BOM RB system, are given in Figure |2| It is clear that, with 
the exception of the earliest 3 hour prediction, the differences between the BNs 
are not statistically significant, while the simple BOM RB system is clearly 
(statistically significantly) underperforming all of the Bayesian networks. This 
demonstrates that the automated data mining methods are capable of improving 
on the causal relations encoded in the rule-based system. In this problem, the 
difference between CaMML and TETRAD II was largely in usability: TETRAD 
II either required additional prior information (temporal constraints) or else 
hand-made posterior edits to reach the level of performance of CaMML. 

In general, network performance was maximal (up to 80% accuracy) at look 
ahead times which were multiples of 12 hours, corresponding to late afternoon or 
early morning. Clearly, there is a strong periodicity to this prediction problem. 
In the future, such periodicity could be explicitly incorporated into models using 
MML techniques (e.g., in selecting parameters for a sine function). 

The training method examined thus far has the drawback that both the 
structure of the model and its parameters are learned in batch mode, with pre- 
dictions generated from a fixed, fully specified model. We speculated that since 
weather systems change over time a better approach would be to learn the causal 
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Fig. 2. Comparison of Airport site type network versions 

structure in batch mode, but to reparameterize the Bayesian network incremen- 
tally, applying a time decay factor so as to favor more recent over old data. The 
(unnormalized) weight applied to data for incremental updating of the network 
parameters was optimized by a greedy search. Figure 0 shows the average per- 
formance of the MML Bayesian networks when incrementally reparameterized 
over the 1998 data. The improvement in predictive accuracy is statistically sig- 
nificant at the 0.05 level, despite the fact that the average scores reported here 
are themselves presumably suboptimal, since the predictions made early in the 
year use parameters estimated from small data sets. 
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5 Conclusion 

This case study provides a useful model for employing Bayesian network technol- 
ogy in data mining. The initial rule-based predictive system was demonstrated to 
be inferior to all of the Bayesian networks developed in this study. The Bayesian 
net elicited from the domain experts performed on a par with those generated au- 
tomatically by data mining. Nevertheless, the data mining methods show them- 
selves to be very promising, since they performed as well as the elicited network 
and so offer a good alternative when human expertise is unavailable. Further- 
more, the adaptive parameterization outperformed the static Bayesian networks 
and provides a possible model for combining elicitation with automated learning. 
In the future, we hope to apply these Bayesian net modeling techniques to more 
challenging meteorological problems, such as severe weather prediction. 
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Abstract. This paper presents a novel graph- based algorithm for solv- 
ing the semi-supervised learning problem. The graph-based algorithm 
makes use of the recent advances in stochastic graph sampling technqiue 
and a modeling of the labeling consistency in semi-supervised learning. 
The quality of the algorithm is empirically evaluated on a synthetic clus- 
tering problem. The semi-supervised clustering is also applied to the 
problem of symptoms classification in medical image database and shows 
promising results. 



1 Introduction 

In machine learning for classification problems, there are two distinct approaches 
to learning or classifying data: the supervised learning and un-supervised learn- 
ing. The supervised learning deals with problem where a set of data are labeled 
for training and another set of data would be used for testing. The un-supervised 
learning deals with problem where none of the labels of the data are available. In 
recent years, important classification tasks have emerged with enormous volume 
of data. The labeling of a significant portions of the data for training is either 
infeasible or impossible. Sufficient labeled data for training are often unavail- 
able in data mining, text categorization and web page classification. A number 
of approaches have been proposed to combine a set of labeled data with unla- 
beled data for improving the classification rate. The naive Bayes classifier and 
the EM algorithm have been combined for classifying text using labeled and 
unlabeled data ^ . The support vector machines have been extended with trans- 
ductive inference to classify text |^. A modified support vector machine and 
non-convex quadratic optimization approaches have been studied for optimizing 
semi-supervised learning P| . Graph based clustering has received a lot of atten- 
tion recently. A factorization approach has been proposed for clustering^], the 
normalized cuts have been proposed as a generalized method for clustering jS] 
and |E|. In this paper, we investigated the use of stochastic graph-based sampling 
approach for solving the semi-clustering problem. Graph-based clustering is also 
shown to be closely related to similarity-based clustering [Zj. 
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2 Graph Based Clustering 



The cluster data is modeled using an undirected weighted graph. The data to 
be clustered are represented by vertices in the graph. The weights of the edge 
of vertices represents the similarity between the object indexed by the vertices. 
The similarity matrix S G where Sij represents the similarity between the 
object Oi and the object Oj. A popular choice of the similarity matrix is of the 
form 



Sij = exp 




J 



( 1 ) 



where d{i,j) is a distance measure for the object i and the object j. The expo- 
nential similarity function have been used for clustering in 1. ini and jS|. The 
graph G is simply formed by using the objects in the data as nodes and using the 
similiarity S^J as the values (weights) of the edges between nodes i and j. The 
minimum cut algorithm partitions the graph G into two disjoint sets (A,B), 
that minimizes the following objective function, 



= ( 2 ) 
i&Aj&B 



The minimum cut problem has been a widely studied problem. The classical 
approach for solving the minimum cuts is via solving the complementary problem 
of maximum flows. Recently, Karger introduced a randomized algorithm which 
solve the minimum cuts problem in O(n^log^n) jHj. The randomized algorithm 
makes use of a contraction algorithm for evaluating the minimum cuts of the 
graph and we will show in later section how we modify the contraction algorithm 
for semi-supervised learning. 



3 Graph Contraction Algorithm 

The contraction algorithm for unsupervised clustering consists of an iterative 
procedure that contract two connected vertex i and j: 

— Randomly select an edge (i, j) from G with probability proportional to Sij 

— Contract the edge (i, j) into a meta node ij 

— Connect all edges incident on i and j to the meta node ij while removing 
the edge {i,j) 

This contraction is repeated until all nodes are contracted into a specified num- 
ber of clusters. The semi-supervised algorithm assumes a set of given labeled 
samples. Initially, assign empty label to all nodes,!. e. Li = (j) for all nodes i. 
Then assign the labels of given labeled nodes to their respective labels. After 
this initialization, the following contraction algorithm is applied: 

1. Randomly select an edge (i,j) from G with probability proportional to Sij, 
depending on the labels of i and j, do one of the following: 
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a) li Li = 4> and Lj = (/>, then Lij = (j) 

b) li Li = 4> and Lj ^ (/>, then assign L^ = Lj 

c) If Lj = (j) and Li ^ (/>, then assign Lij = Li 

d) If Li ^ Lj and Li ^ 4> and Lj ^ (/>, then remove the edge (lj) from G, 

and return to step I 

2. Contract the edge (lj) into a meta node ij 

3. Connect all edges incident on i and j to the meta-node ij while removing 
the edge (lj) 

This contraction is repeated until all nodes are contracted into a specified num- 
ber of clusters or until all edges are removed. This semi-supervised contraction 
guarantees the consistency in the labeling outcome in merging the meta-nodes. 
Furthermore, both of the algorithms above can be repeated as separate random- 
ized trials on the data. The results of each trial can be considered as giving the 
probability of connectivity between individual nodes and then be combined to 
give more accurate estimation of the probability of connectivity. 



4 Testing on Synthetic Data 

A synthetic dataset is used for testing the semi-supervised clustering algorithm. 
Figure G1 shows the data for a two cluster clustering problem. The two clusters 
are seperated with a sinusoidal boundary. The minimum cut algorithm with 
randomized contraction is applied to the synthetic data. The result of a typical 
trial is shown in El The cross on the right middle part of the figure is the sigeleton 
that is separated from the rest of the nodes. The algorithm fails to recover the 
two clusters as the minimum cut is often given by splitting the data into a big 
cluster and an outlying point which have a large distance from its neighbors. To 




Fig. 1. Two clusters 



Fig. 2. One solution of randomized 
contraction 



test the performance of the semi-supervised clustering algorithm on the synthetic 
data, three test situations are considered: 
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1. one sample from the top half and one sample from the bottom half is ran- 
domly chosen as training samples 

2. two samples from top cluster and two samples from the bottom cluster 

3. ten samples from top cluster and ten samples from the bottom cluster, 

the training samples in the above three cases are shown in Figure^ Figure 2]and 
Figure El respectively. The training samples for the top cluster are shown with 
plus sign and the training samples for the lower cluster are shown with circle 
sign. With such a small number of training samples, one can judge from the 
figures that inference based/decision surface based classifier will not be able to 
determine accurately the sinusoidal boundary between the cluster. The solution 
for one trial of the semi-supervised clustering has been shown in Cl There are 
a few misclassified data points on the middle left portion of the graph. The 
average error percentage of the semi-supervised clustering is shown in Table El 
From the table, we can see that as the number of training samples is increased 
the average error percentage of classification drops, which is consistent with 
expectation. Furthermore, if we considered a single trial as given the probability 
of a pixel belonging to a cluster, we can average out the probabilities obtained 
from different trials and obtain the error of the combined estimation. From 
Table 0 we can see that the combined estimation is very accurate, having errors 
of less than one percent . 



Fig. 3. Train. Samples I Fig. 4. Train. Samples II Fig. 5. Train. Samples III 



5 Medical Image Database 

Tongue diagnosis is an important part of the Four Diagnosis in Traditional Chi- 
nese Medicine |0| where a physician visually examines the color and properties 
of both the coating of the tongue and the tongue proper mm- Sample tongue 
images from nine patients are shown in Figure IHl The images show that there are 
large variations in the color of tongue proper and the color of surface coating. In 
the center image of the middle row, the tongue is covered with yellowish coating. 
In the right image of the bottom row, the tongue is covered with dense white-gray 
coating. The left image in the top column shows a pink tongue and the center 
image of the bottom row shows a deep red tongue. In western medicine, the 
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Average error 


Error of combined est. 


1 


16.58% 


0.2% 


2 


9.24% 


0.6% 


3 


2.82% 


0.4% 



Fig. 6. Performance of semi-supervised cluster- ■'■ t- ■ 

ing: (1) Average error of single trials (2) Error '■ ... r =' 

of combined estimation 

Fig. 7. One solution of semi-supervised 
randomized contraction 



visual examination of tongue also reveals important information on the patient. 
The glossitis and the geographic tongue can be diagnosed by visually exam- 
ining the tongue. Glossitis maybe caused by local bacterial or viral infections 
on the tongue or be caused by systemic origin such as: iron deficiency anemia, 
pernicious anemia and vitamin deficiencies. 

Images drawn from sections of tongue images from different patients are used 
for comparison. The tongue image is then segmented into square blocks of 36x36 
and blocks which cover the tongue are then selected from the image. Figure 0 
shows the partitioned blocks from a sample image. Tongue images are taken 
from 64 patients and a total of 6788 blocks are extracted. The color mean and 
variance of a color block is used as the representative features of the block. Thus, 
a color block is represented by its color means and variances as six attributes 
ifJ-i, ^J- 2 , The color attributes are color values in RGB color space 

or color values in GIF L*u*v* color space. 




Fig. 8. Sample Tongue Images 




Fig. 9. Partition of blocks in a 
sample image 



We have previously designed algorithms for color cluster analysis on the 
tongue image database m In this section, we extended the analysis by incor- 
porating labeled samples corresponding to typical symptoms shown in various 
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tongue diagnosis. Twenty color blocks are selected from different patients cor- 
responding to typical symptoms. The following table shows the diagnosis of the 
20 samples selected. Each color block is indexed by three coordinates (n, bx, by), 
where n is the the patient number in the database, bx is the horizontal block 
number in the image and by is the vertical block number. The semi-supervised 



Table 1. Training samples for tongne image database 





TBC 


Symptoms 




TBC 


Symptoms 


1 


(8,8,7) 


pale purple 


2 


(11,5,9) 


pale purple 


3 


(43,11,9) 


red 


4 


(31,8,6) 


red 


5 


(27,12,4) 


light pale 


6 


(51,10.9) 


thin white coat (pale red) 


7 


(56,8,9) 


thin white coat (light pale) 


8 


(59,8,6) 


thin white coat (pale) 


9 


(42,7,5) 


thin white coat (dark red) 


10 


(49,7,12) 


thin white coat (red) 


11 


(31,5,9) 


thin while coat (red) 


12 


(32,4,5) 


pale yellow coat (pale red) 


13 


(52,3,4) 


pale yellow coat (light pale) 


14 


(40,3,6) 


dark yellow (thick) 


15 


(27,4,11) 


dark yellow (light pale) 


16 


(29,3,3) 


dark yellow (thick) 


17 


(17,4,8) 


thick white (light pale) 


18 


(18,13,6) 


”kong” 


19 


(61,5,5) 


thick white 


20 


(22,4,4) 


light yellow (pale purple) 



clustering is then applied to the clustering of tongue image blocks in the medical 
image database. The semi-supervised clustering is repeated ten times for the 
complete image database. Each block will be classified into one of the 20 sample 
classes in each trail. To find the major symptoms associated with a patient, we 
first calculated the matches of all the image blocks of the patient in a single 
trial. The sum of the number of matches of all image blocks in the 10 trials 
are accumulated. The five highest matches are taken as the major symptoms 
for the patients. To show the typical performance of the algorithm, the results 
on the first seven patients are tabulated here. Table 01 shows the results of the 
semi-supervised clustering and the results for the nearest neighbor classifier are 
shown in Table 01 for comparison. For example, for patient 1 in Table 0 < the 
first symptom corresponds to class 2, which is a pale purple tongue. The second 
symptom corresponds to class 11 which is a thin white coating on red tongue. 
The symptoms that is not consistent with the judgement of Chinese medical 
doctor is underlined. For most of the patient cases, the differences in the results 
between the two algorithms lies in the order of the first five symptoms discovered. 
Judging from the symptoms discovered from the algorithms and the symptoms 
classified by Chinese medical doctors, the semi-supervised clustering algorithm 
is found to have higher consistency. 

To conclude, a novel graph-based algorithm for solving the semi-supervised 
learning problem is introduced. The graph-based algorithm makes use of the re- 
cent advances in stochastic graph sampling and a modeling of the labeling con- 
sistency in semi-supervised learning. The quality of the algorithm is empirically 
evaluated on a synthetic clustering problem. The semi-supervised clustering is 
also applied to the problem of symptoms classification in medical image database 
and promising results have been obtained. 
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Table 2. Major symptoms discovered 
by the nearest neighbour classifier 



Patient no. 


Sym I 


II 


III 


IV 


V 


1 


2 


11 


5 


1 


10 


2 


10 


1 


6 


2 


5 


3 


10 


5 


6 


20 


15 


4 


19 


8 


12 


11 


13 


5 


12 


15 


8 


11 


19 


6 


19 


5 


1 


6 


9 


7 


8 


20 


2 


11 


15 



Table 3. Major symptoms discovered 
by the semi-supervised clustering 



Patient no. 


Sym I 


II 


III 


IV 


V 


1 


11 


2 


5 


10 


1 


2 


10 


6 


11 


5 


1 


3 


20 


10 


5 


6 


19 


4 


8 


12 


20 


19 


13 


5 


12 


13 


8 


19 


20 


6 


9 


7 


19 


5 


6 


7 


8 


20 


19 


12 


2 
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Abstract. Student modeling has been an active research area in the field of in- 
telligent tutoring systems. In this paper, we propose a rough data mining ap- 
proach to the student modeling problems. The problem is modeled as a knowl- 
edge discovery process in which a student’s domain knowledge (classification 
rules) was discovered and rebuilt using rough set data mining techniques. We 
design two knowledge extraction modules based on the lower approximation set 
and upper approximation set of the rough set theory, respectively. To verify the 
effectiveness of the knowledge extraction modules, two similarity metrics are 
presented. A set of experiments is conducted to evaluate the capability of the 
knowledge extraction modules. At last, based on the experimental results some 
suggestions about a future knowledge extraction module are outlined. 



1 Introduction 

Building student models [9] to effectively represent a learner state has been an active 
research area in the field of intelligent tutoring systems. There has been much effort in 
constructing student models in the literature, including the overlay model [3], debug 
model [4], dynamic model [8], and model tracing [1]. In this paper, we propose a 
rough data mining approach based on rough set theory [5] [6] to the student-modeling 
problem. Rough set theory has found many applications in artificial intelligence and 
cognitive science, and it is applied successfully to many practical real-world problems. 
Not like the fuzzy set theory [10], rough set theory does not need a well-predefined 
membership functions to proceed successfully. Instead, it relies only on the available 
data and attributes to work on the generation of rules. In particular rough set theory is 
a deterministic data mining methodology useful not only for large data sets but also 
for small amount of data, for which statistical methods may not be adequate. This is 
especially Important for the context of this research in which a student’s answering 
records may not be a large record set. 

In this paper the student-modeling problem is represented as a knowledge discovery 
process in which a student’s domain knowledge (classification rules) was discovered 
and rebuilt using rough data mining techniques. Two knowledge extraction modules 
were designed based on the lower approximation set and upper approximation set of 
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the rough set theory, respectively. To verify the effectiveness of the knowledge ex- 
traction modules, we present two similarity metrics. Based on these metrics, a set of 
simulation experiments is conducted to evaluate the capability of the knowledge ex- 
traction modules. At last following the experimental results some suggestions about a 
future knowledge extraction module are outlined. 



2 Rough Set Theory 



A rough set [7] is represented by two sets, a lower approximation set and an upper 
approximation set. Let U be the closed universe set of objects, (2=CuD be the attrib- 
ute set where C is the condition attribute set and D is the decision attribute set, and A 
be any subset of Q, then define /(A) as a binary relation on U, called "indiscernibility 
relation", that is, 



X 1(A) y iffa(x) = a(y) for every a e A, x, yeU ( 1 ) 

where a(x) denotes the value of attribute a for object x. Note that 1(A) is an equiva- 
lence relation. Besides, let U/I(A) be the partition determined by the relation 1(A), 
denoted as U/A, and let be the equivalent class containing object x. Then we say 
that (x, y) is A-indiscernible if (x, y) belongs to 1(A) and the set of equivalent classes of 
the relation 1(A) is called the A-elementary sets. Now, the lower approximation set and 
upper approximation set of X based on A are defined as follows: 

(1) A(X) = {x G U: [x]^ c X] is the set of objects whose equivalence class is included 
in X. It is called the A-lower approximation set of X. 

(2) ^ (X) = {x G U: [x]^ n X ?^ (|) }is the set of objects whose equivalence class is 
overlapped with X. It is called the A-upper approximation set of X. 

For an equivalence class [x]^ g ^ (X), define the confidence degree of [x]^ in ^ (X) 
as: 






\[x]a ^ X 

|Wa| 



( 2 ) 



3 Rough Set Data Mining 

In this section our rough data mining approach for building student models is outlined. 
The system first generates testing examples for the student. The student tries to clas- 
sify the examples correctly using whatever he/she knows. All the answering records 
are kept in a database for further analysis to construct the student model (classification 
rules). Some form of interface was designed to allow the student to visually edit the 
classification rules. With these facilities, the student and the system can collabora- 
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lively work out the real student model, which can then be used for further more accu- 
rate diagnosis. 

As to the application of rough set theory in data mining, refer to the work of [2], 
which is based only on lower-approximations. Let I {/I be the total number of the test- 
ing examples, and ICI be the number of condition attributes, then the complexity order 
of the Lower_Set_Rule_Extraction algorithm in [4] is 0(\C'^\Uf'). Note that every 
lower set found in this algorithm is a potential rule for output if its support degree is 
higher than some threshold value. The support degree for a specific rule is defined as 
follows: 



where W is the set of examples that match the conditions of the rule. 

Since the Lower_Set_Rule_Extraction algorithm is based on lower set approximations, 
all the classification rules it generate is deterministic. We would like to investigate 
how things are going on when the rules are generated based on upper set approxima- 
tions, so we devise another Upper_Set_Rule_Extmction algorithm, which is outlined 
as follows: 

Algorithm Upper_Set_Rule_Extraction 

Stepl . Use the Lower_Set_Rule_Extraction algorithm to 
generate all classification rules. 

Step2 . For those upper sets X of WgU/D generated in the 
last iteration of the Lower_Set_Rule_Extraction algo- 
rithm, compute the support degree of X and the confi- 
dence degree of X. If the support degree is greater 
than a user-specified threshold 6^, and the confidence 
degree of X is greater than a user-specified threshold 
dc, and X hasn't been included in previously output 
rules, then output the corresponding rule. 

The Upper_Set_Rule_Extraction algorithm generates those deterministic rules gener- 
ated by the Lower_Set_Rule_Extraction algorithm as well as those non-deterministic 
rules with high support degree and confidence degree. Its time complexity is also 
0{\C\^\U\^). 



4 Effectiveness Metrics 



In this section we propose two similarity metrics: and M^. Let S be the set of stu- 

dent classification rules, E be the set of the extracted rules, and E^ be the similarity 
degree of E with respect to S, and be the similarity degree of S with respect to E. 
Besides, let r be the set of conditions in rule re 5, and t^ be the set of conditions in rule 
t gE. Then 
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M. 



|S n 



X Wj 



\s n Fh 

1^1 J 



X 



(4) 



where Wj + w^ = 1, 0 <= w,, w^<= 1. That is, measures the degree of exact matches 
between the rules in E and S. Since this measurement might somewhat too strict, we 
define another metric M^. Let k ^ be a rule-by-rule similarity matrix in which 



K{r,t) = 





if rule r and rule t are classification rules for 



the same decision pattern; otherwise k (r, f )=0, where w^ + w^ = 1, 0 <= w^, w^<= 1. 
Then 



Mj = 



f 


1 X Wi -h 


f ^0 


U^L 




U^lJ 



(5) 



where p 

^ s 



Y, \Max K{t,r))’ 



Y,\Max K{t,r)) 



. That is, in the condition 



level of similarity, we keep the strict-match principle, but we take a softer measure- 
ment for the similarity between rule sets. This is inspired by the fact that partial 
matches between rule sets also contribute to the collaborative construction of the stu- 
dent models. 



5 Simulation Experiments 



5.1 Experiment Design 

The experiment process is as follows. First, expert classification rules are obtained 
through a knowledge engineering process. The expert rules are then transformed in 
specific manners to simulate various kinds of students’ misconceptions. The trans- 
formed rule sets {R^, RJ are called Student Rule Sets. These rule sets are then 

applied to a sample database (with more than 1500 samples) and the classification 
results are stored as the student answering records. Now we can apply the rule extrac- 
tion modules to get the Extracted Rule Sets (Rf Rf..., Rf), respectively. Finally, the 
similarity between each R. and set and R' set is computed using and M^, respec- 
tively. 

Some typical kinds of students’ misconceptions are considered in the experiment sets. 
For example, in case of over-generalized/over-specialized concepts some nega- 
tive/positive samples are misclassified as positive/negative samples. In case of not 
sufficient knowledge, students are unable to classify some samples. In case of redun- 
dant concepts, students would be able to classify all samples correctly if the redundant 
concepts were removed. At last, the case of lack of knowledge is aimed to describe a 
new beginner who holds little domain knowledge. To simulate these kinds of miscon- 




On Application of Rough Data Mining Methods 165 



ceptions, we design six kinds of rule transformations. For each kind of the misconcep- 
tions aforementioned, three sub-experiments are conducted and analyzed using the 
and similarity metrics for the lower-set extraction algorithm and the two upper-set 
extraction algorithms with =0.5 and 6J,=0.8> respectively. All the three algorithms 

are executed with Q =0.5. 



5.2 Experimental Results 



The Delete-Some-Conditional-Attributes Experiment. This experiment is planned 
to simulate misconceptions of over- generalized concepts. It includes sub-experiments 
on sets of student rules, each of which was gained by deleting one conditional attribute 
in specific location from each expert rule. In average the upper-set rule extraction 
algorithms perform better than the lower ones. 

The Delete-Some-Rules Experiment. This experiment is planned to simulate 
misconceptions of insufficient knowledge by deleting a portion of the expert rules. It 
includes six sub-experiments for which the deleted rules are, respectively, (1) the 
even-number indexed rules, (2) the rules whose indices are multiples of 3, (3) the rules 
whose indices are multiples of 4, (4) the first one-third of the rules, (5) the second one- 
third of the rules and (6) the last one-third of the rules. The results show that the upper 
algorithms outperform the lower one with M2 > 0.53. 

The Change- Conditional- Attribute- Values Experiment. This experiment is 
planned to simulate misconceptions of mixing over-generalized and over-specialized 
concepts. It includes seven sub-experiments. The first five experiments change the 
conditional attribute values, while the last two change the decision attribute values. In 
these cases the upper extraction algorithms perform much better and more stable than 
the lower one. 

The Add-Conditional-Attributes Experiment. This experiment is planned to 
simulate misconceptions of over- specialized concepts. It includes five sub- 
experiments, each of which adds specific conditional attribute and values to the expert 
rules. The M2 values of the three algorithms lie between 0.62 and 0.79, which show 
the reconstruction capability of the rule extraction algorithms is acceptable. 

The Add-Some-Rules Experiment. This experiment is planned to simulate 
misconceptions of redundant concepts. It includes five sub-experiments, each of which 
adds one specific rule into the expert rules. Again the upper rule extraction algorithms 
outperform the lower one and are more stable. 
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The Delete-Large-Amount-of-Rules Experiment. This experiment is planned to 
simulate misconceptions of a new beginner with very few concepts. It includes six 
sub-experiments, each of which deletes randomly five-sixth of the expert rules. The 
result is quite satisfying. 

As a summary, the lower algorithm performs better in Mj metric, while the upper 
algorithms outperform the lower one in metric significantly. The confidence 
threshold 0 ^ did not show significant difference in the metric for the upper algo- 
rithms. Nevertheless, our results show that the upper algorithm with 0 =0.5 performs a 
little better than the one with 0^ =o.8- 



6 Conclusive Remarks 

In this paper, we investigate the feasibility of applying rough data mining methods to 
automatic construction of student models. The simulation results show that it is hard to 
extract "just-the-same" rules using current rough data mining methods. Nevertheless, 
The rough data mining methods, especially the upper algorithms, perform significantly 
well to extract ” almost-the-same" rules. Besides, when inconsistency exists in the 
student rules, the upper algorithms can deal with the inconsistency problem very well. 
Finally, to improve the rule extraction effectiveness, the extraction algorithms should 
use only those attributes that students had really adopted to give the answers. This will 
contribute much especially to the extraction of inconsistent student rules. 
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Abstract. In this paper we present a novel approach to the concept 
approximations in concept lattice. Using the similar idea of rough set theory 
and unique properties of concept lattice, upper and lower approximations of 
any object or attribute set can be found by exploiting meet-(union-)irreducible 
elements in concept lattice, the approximations can be performed on the fly. 
We show that our approach is more natural and effective than existing 
approach. 



1 Introduction 

Concept lattice, also called Galois lattice, was first proposed by Wille[3]. A node of 
concept lattice is an objects/attributes pair, called a formal concept, consisting of two 
parts: the extension (objects the concept covers) and intension (attributes describing 
the concept). Concept lattice gives a vivid and concise account of relations 
(generalization /specialization) among those concepts through Hasse Diagram. 
Concept lattice is useful for data mining[4,5,9], information retrieval)?], and soft 
engineering[8], etc. 

Not every pair of objects and attributes defines a formal concept. Only those 
maximally extended ones are included in concept lattice, i.e., attributes in intension 
are maximal common attributes of objects in extension and vice versa. This brings 
forward a problem that how to best approximate a set of objects or attributes if there 
is no exact match in the concept lattice. Furthermore, is it possible to approximate 
them without knowing the whole concept lattice? It is an expensive operation to 
generate the whole lattice after all. 

This kind of approximation may be useful in many situations. For example, 
assuming that we have a concept lattice describing a set of documents and a set of 
keywords, when given a query (a set of keywords), we could find out the best 
approximate result if an exact match failed. 

Rough set theory[6] efficiently approximates a given concept by using a pair of 
concepts, namely the upper and lower approximations. In this paper, using the similar 
idea we propose a concept approximation method in concept lattice. However, there is 
something different. First, our approach is not based on the equivalence classes. 
Second, our approach makes use of the properties of lattice, that is, the existence of 
meet-irreducible elements, which greatly simplifies the computation of 
approximation. 

There are some similar works [1,2]. We argue that our method is more natural 
and effective than existing approach, and by exploiting meet-irreducible elements in 
concept lattice, we could generate concept approximations on the fly. 



D. Cheung, G.J. Williams, and Q. Li (Eds.): PAKDD 2001, LNAI 2035, pp. 167-173, 2001. 
© Springer- Verlag Berlin Heidelberg 2001 




168 K. Hu et al. 



The rest of the paper is organized as follows. Section 2 recalls necessary notions 
used in this paper. Section 3 introduces related work. Our approach is presented in 
section 4 and an illustrating example is given in section 5. Section 6 concludes the 
paper. 



2 Basic Notions 

In this section we recall necessary basic notions of concept lattice and rough set 
briefly. The detail description can be found in [3,6]. 

First, we begin with some notions from concept lattice. 

Suppose given the context {O, D, R) describing a set O of objects, a set D of 
descriptors and a binary relation R, there is a unique corresponding lattice structure L, 
which is known as concept lattice. Each node in lattice L is a pair, noted (X, Y), 
where XgP(0) is called extension of the concept, YgP{D) is called intension of 
concept. Each pair must be complete with respect to R. i.e.: 

(1) X =P(Y)={xgO I Vye Y, yRx}i2) Y = a(X)={yG£> | VxgX, yRx] 

A partial order relation can be built on all concept lattice nodes. Given Hj=( Xj, 
Yj) and Hj=( X^, Y^j^let H,< <?:> XjC X^, the precedent order means H, is a direct 

parent of H^. The Hasse diagram of the lattice can be generated using the partial order 
relation. If Hj<Hj and there is no other node such that there is an edge 

from H, to H^. 

In rough set theory, information system plays similar role as context in concept 
lattice. An information system is a ordered pair S=(0, D), where O is a non-empty, 
finite set called the universe, £) is a non-empty, finite set of attributes. The elements 
of the universe are called objects. 

Let S=(0, D) be an information system, every subset BcD defines an 
equivalence relation IND(B), called an indiscernibility relation, defined as 
IND(B)=|(x,y)GOxO: a(x)=a(y) for every aeBj. 

Given an information system S=(0, D), let XcU be a set of objects and BcD a 
selected set of attributes. The lower approximation of X with respect to B is 
B,(X)=|xGO:[x]gCX}. The upper approximation of X with respect to B is 
B’(X)=|xgO: [xj^nX^d)}, where [x]3={yG0: (x,y)e IND(B)}. 

Upper approximate consists of all objects possibly belonging to X and lower 
approximate consists of all objects definitely belonging to X. Obviously we have 
B.(X) cXcB‘(X). 

Concept lattice includes all concepts in context, and assembles them in a visual 
concept hierarchy while rough set provides a powerful concept approximation 
mechanism. 



3 Related Work 

To the best of the author’s knowledge, there are two existing works on concept 
approximation, that is, Kent’s work[l] and Saquer’s work[2]. 
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Kent used an equivalence relation E on the set of objects O provided by an 
expert. A pair {O, E), where E is an equivalence relation on O, is called an 
approximation space. An ^-definable formal context of 0-objects and O-attributes is 
a formal context (O, D, R) whose elementary extents {gGOIg/?m, meO} are E- 
definable subsets of 0-objects. Two new concept lattices— the lower and upper E- 
approximation of R with respect to (O, E), are then defined. Undefined concepts are 
approximated by finding closest elements in the two new approximating lattices. 

Kent’s work is not natural because the upper and lower approximation of the 
concept lattice have to be found first and the resulting approximations depend on the 
equivalence relation chosen. 

Saquer and Deogun defined a natural equivalence relation on O as gjg^ iff 
g,R=gJt where g/?={mG/)lg/?m, geO}. If a set AcO is an extent of concept lattice, it 
is called feasible; if a set AcO is the union of feasible set, it is called definable. By 
declaring each equivalence class 0/1 is a feasible set, a non-definable set AcO is 
approximated by A,={geOI[g] cA} and A’={geOI [g] n A ?^: <!>} . 

Saquer’ s idea is to approximate a new concept using the union of existing 
concepts in concept lattice. That is similar to the idea of rough set, but things in 
concept lattice are different. We could observe that all possible concepts in O are 
included in concept lattice and the definable sets may be not in concept lattice. This is 
to say, the definable sets that may be not a concept can be expressed by the concept 
lattice. In such a situation the approximation become meaningless. Another thing, it is 
not so easy to know whether a set is definable according to the theory given. 

What is the best approximation of a given concept? We argue that the best 
approximation of a given concept is two existing concepts in concept lattice, which 
approximate the concept in both directions. This is because existing concepts in 
concept lattice are all possible concepts we “know” from original context. 



4 Our Approach 

First we introduce the following definition and theorem from the lattice theory. 
Definition 4.1 An element a is meet-irreducible in a lattice L if for any b,ce L, 
a=bAC implies a=b or a=c; dually, an element a is union-irreducible in a lattice L if 
for any b,ceL, a=bvc implies a=b or a=c. 

Theorem 4.1 Every element is the meet(union) of the meet-irreducible (union- 
irreducible) elements. 

Given context {O, D, R) and corresponding concept lattice L, let g/?={mG/)l 
gRm, gsO}. 

Definition 4.2 Given context (O, D, R), a binary relation J on O is defined as 
giJg 2 iff gi^Ggjf^^ where g„g^eO 

Clearly, J is reflexive, anti- symmetric and transitive. Thus, J is a partial relation on O. 
We denote partial class of g as [g], namely, [g]={g'eO : gJg'}. 

Theorem 4.2 Every pair ([g], P([g])) is a union-irreducible element of L. 

Proof. Eirst we prove ([g], P([g])) is an element of L. We need only show [g] is 
maximally extended. By the definition of [g], all elements have maximal common 
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attribute with g are in [g], so [g] must be maximally extended. Therefore the pair must 
be in the L. 

Assume ([g], P([g])) is not a union-irreducible element of L, there are two pairs (A, 
P(A)), (B, P(B)cO such that [g]=AvB and [g]?^A or B. So [g] 3 A, [g] 3 B, and geA 
or geB. we assume geA. From the duality of concept lattice, we have P([g]) ^ (A). 
This is to say, g must have a larger attribute set in pair (A, P(A)), hut this is a 
contradiction, because from the definition of [g], we know that the attribute set 
processed by g should he smallest. Therefor ([g], P([g])) must be a union-irreducible 
elements of L. 

Let UI be the set of all the union-irreducible elements in L, P be the set of all pair ([g], 

P([g])). 

Theorem 4.3 UI=P 

proof. From theorem 4.2, we know every element in P is also an element in UI. We 
only need prove that the union-irreducible elements must be elements in P. 

Assume (A, P(A)) is one of the union-irreducihle elements. If there is an object geA 
such that g/?=P(A), so ([g], P(A)) must be in P; or for every object geA, there is 
gRp) (A), so (A, P(A)) can he decomposed into v(gp P(g,)), gjCA. This contradicts to 
that (A, P(A)) is a union-irreducible element. Therefore, there must be an object geA 
such that g/?=P(A). i.e, the union-irreducible elements must be elements in P. 

From theorem 4.3 and 4.1, we know all elements in concept lattice can be expressed 
in term of union of elements in P. 

Similarly, Let /?m={gGOIg/?m}, we have: 

Definition 4.3 Given context (O, D, R), a binary relation J’ on D is defined as 
mjl’m^ iff /?m,cRm 2 , where mj.m^GD 
We denote partial class of m as [m], namely, [m]={m’eO : m I’m’}. 

Theorem 4.4 Every pair (a[m], [m]) is a meet-irreducible element of L. 

Let MI be set of all the meet-irreducible elements, Q be the set of all pair (a[m], [m]). 
Theorem 4.5 MI=Q 

The proof is similar to theorem 4.2 and 4.3. 

From theorem 4.5 and 4.1, we know all elements in concept lattice can also be 
expressed in term of meet of elements in Q. 

Definition 4.4 Given XcO, the lower and upper approximations of X with respect to 
L are 

X.=u { X’cXI(X’,P(X’)) G UI ) 

and 

X‘=n{XcX’l (X’,P(X’))gMI} 

Theorem 4.6 Given X, XtO, The lower approximation satisfies the following 
properties: 

(1) U.=U 

(2) X.cX 

(3) XcX’-^X.cX’. 

(4) (XuX’).=X.uX’. 

(5) X=X.. 

and the upper approximation satisfies the following properties: 

( 6 ) 0=0 

(7) XcX* 
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(8) XcX’-aX'cX’‘ 

(9) (XnX’)'=X‘nX’‘ 

(10) X*=X" 

Theorem 4.7 Given XcO, the best lower and upper approximations are given by (X„ 
P(X.)) and (X*,P(X*)), respectively. 

proof. We prove that X. is the best lower approximation and the proof of upper 
approximation is similar. Assume there is another concept (A, P(A) such that 
X.cAcX. So A can be expressed as union of X, and some union-irreducible 
elements. This is A=X.v(vuij) cX, where uijGUI. Thus uijCX. According the 
definition of X„ urcX,. Then we have A=X.. Therefore X, is the best approximation 
ofX. 

Dually, we have the following definition and theorems for attribute set approximation. 
Definition 4.5 Given YcD, the lower and upper approximations of Y with respect to 
L are 

Y =n{YcY’l (a(Y’), Y’)gMI) 

and 

Y’=u{Y’cYI(a(Y’), Y’)gUI} 

Theorem 4.8 Given YcD, the best lower and upper approximations are given by 
(a(Y.), Y.) and (a(Y ), Y*), respectively. 

Note we do not find upper and lower approximations for any pair(X, Y), XcO, YcD, 
because this kind of combination may be meaningless in the context, i.e., it does not 
represent any meaningful concepts. For example, finding approximation for {O, R) is 
meaningless and impossible. 

5 An Illustrating Example 

In this section we illustrate our idea using an example from [3]. Figure 1 is a simple 
context, and figure 2 is the corresponding concept lattice. We have 0=[ 1, 2, 3, 4, 5, 6, 
7, 8}, D={a, b, c, d, e, f, g, h, i), and R describing objects in O processing some 
attributes in D. 
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X 
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Fig. 1 A context excerpted from [3] pi 8. a=needs water; b=lives in water; c=lives on land; 
d=needs chlorophyll; e=two seeds leaf; f=one seed leaf; g=can move around; h=has limbs; 
i=suckles it offspring. 

From definition 4.2 and 4.3, partial class of objects and attributes can be computed as 
follows: 

[1] ={ 1,2, 3}, corresponding union-irreducible elements is ({ 1,2,3 },{a,b,g)) 

[2] ={2,3}, corresponding union-irreducible elements is ({2,3},{a,b,g,h}) 
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[3] ={3}, corresponding union-irreducible elements is ({3},{a,b,c,g,h}) 

[4] ={4}, corresponding union-irreducible elements is ({4}, {a, c, g,h,i}) 




[5] ={5,6}, corresponding union-irreducible elements is ({5,6},{a,b,d,f}) 

[6] ={6}, corresponding union-irreducible elements is ({6},{a,b,c,d,f}) 

[7] ={7}, corresponding union-irreducible elements is ({7},{a,c,d,e}) 

[8] ={6,8}, corresponding union-irreducible elements is ({6,8},{a,c,d,f}) 

[a] ={a}, corresponding meet-irreducible elements is ({ 1,2, 3, 4, 5, 6, 7, 8}, {a}) 

[b] ={a,b}, corresponding meet-irreducible elements is ({ l,2,3,5,6},{a,b}) 

[c] ={a,c}, corresponding meet-irreducible elements is ({3,4,6, 7, 8}, {a,c}) 

[d] =(a,d}, corresponding meet-irreducible elements is ({5,6,7,8},{a,d}) 

[e] =(a,c,d,e}, corresponding meet-irreducible elements is ({7},{a,c,d,e}) 

[f] ={a,d,f}, corresponding meet-irreducible elements is ({5,6,8},{a,d,f}) 

[g] =(a,g}, corresponding meet-irreducible elements is ({ l,2,3,4},|a,g}) 

[h] =(a,g,h}, corresponding meet-irreducible elements is ((2,3,4), |a,g,h}) 

[i] =(a,c,g,h,i}, corresponding meet-irreducible elements is ({4},{a,c,g,h,i}) 

We can verify in figure 2 that they are meet-irreducible and union-irreducible 
elements respectively. 

Given a set X={2,4), its lower and upper approximation can be computed as: 
X=u{X’cXI(X’,P(X’))gUI}=u{4}={4) 

X*=n|XcX’l (X’,P(X’))GMI)=n{{l,2,3,4,5,6,7,8},{l,2,3,4},{2,3,4}}={2,3,4}. 

We can verify that X, and X are the best approximations in the concept lattice. 
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Using method in [2], we can get X,= {2,4}, X ={2,4}. And {2,4} is not an expressible 
concept by the concept lattice. 



6 Conclusion 

We propose a novel approach for the concept approximations in concept lattice using 
the idea of rough set theory. However, our method is based on the partial relations and 
properties of lattice. By using the partial relations on objects and attributes, we are 
able to find the meet-(union-) irreducible elements of concept lattice, and then 
approximate a set of objects or attributes using these elements. The computation of 
approximation is straightforward, without the need of generation of the whole concept 
lattice, this avoids computing and memory burden brought by lattice generation. 
Comparing to the existing approach, our method is more nature and effective. 
Acknowledgments. This research is supported by Natural Science Foundation of 
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Abstract. In relational theory, attribute domains are classical sets; 
no interactions among attribute values are modeled. So the concept 
hierarchies, which are additional semantics, used in data mining have 
to be input by users. In this paper, ’’real world” data model - relational 
model with additional semantics specified by binary relational structures 
(adopt from first order logic) - are explored; in such model, concept 
hierarchies/networks can be generated automatically. In fact, there 
are two families of concepts. One family forms a traditional hierarchy. 
Another forms a hierarchy syntactically, but semantically the hierarchy 
is a network; this is due to the fact that distinct concepts may be 
semantically related. A simple example is illustrated. 

Keywords: attribute oriented generalization, binary relation, concept 
hierarchies/networks, data mining, partition, granulation, neighborhood 
systems 



1 Introduction 

In PI we stated to the effect that concept hierarchy are supplied by human, 
and ’’some automations of building such semantic relations are needed for large 
scaled applications. We will report our exploration in future papers.” This paper 
is to honor the statement. 

In traditional attribute- oriented g eneralization{ AOG) approaches concept 
hierarchies are specified as trees, with the attribute values (known as base con- 
cepts) at the leaves, and higher level concepts as the interior nodes. These hi- 
erarchies embody certain implicit assumption on data structures of the active 
attribute domains . The primary assumption is that there is a nested sequence 
of equivalence relations among these leaf concepts. The first level parent nodes 
represent the equivalence classes of the inner most equivalence relation. Such as- 
sumption restricts hierarchies to have tree structures, precluding other types of 
relationships among concepts. In |5|, the notion has been successfully extended 
to binary relations, in other words, instead, there is a nested sequence of binary 
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relations among the leaf concepts. We call it the concept hierarchy/network, 
since its structure is more than a tree. 

In this paper, ’’real world database - relational database with additional 
semantics - are investigated. Based on the additional semantics, concept hierar- 
chies/networks can be automatically generated. In fact, somewhat surprisingly, 
we find there are two concept ’’hierarchies;” one is the same as traditional one, 
another one is syntactically a hierarchy, but semantically forms a network 



2 Relational and “Real World” Model 

The intent of database is to model the universe of real world entities by tu- 
ples of elementary concepts (attribute values). For simplicity relational theory 
tactically assumes everything, such as the universe and attribute domains, are 
classical Cantor sets. In other words, the interactions among elements in real 
world sets are ’’forgotten” in relational modeling. However, in practical data 
processing, some additional semantics derived from ’’real world” attributes are 
often processed. For example, in numerical attributes, the order of numbers is 
often used in SQL statement. In multimedia databases, the geographical rela- 
tionships, such as ’’near,” ”in the same area” are often used in data processing 
by human operators. Therefore these ’’real world” semantics implicitly exist in 
the stored data. To capture such additional semantics in data mining, we need 
the ’’real world” structure to replace Cantor set theory. 

What would be the ’’correct” mathematical structure for ’’real world”? This 
is a question that may have many add hoc answers. We decide to consult the 
history. In the model theory of first order logic, relational structures have been 
used to model the ’’real world.” So we will assume the universe of discourse has 
relational structures. As a first step, we will confine ourselves to the simplest 
kind of relational structure namely, binary relations. So we will assume that all 
attribute domains are embedded with binary relations; see Section o 



2.1 “Real World” Structure- Additional Semantics 

To illustrate the fundamental idea, let us modify a relation (Table from a 
popular text | 2 | by giving some additional semantics using binary relations: 

1. order relation, <, is defined on Dom{ST ATU S) naturally, 

2. ’’near” semantics in the attribute domain Dom{CITY) is defined alge- 
braically by Table 0or geometrically by Table 0 



2.2 “Real World” Dependencies - Strong and Weak Dependencies 

Note that TableOlhas a binary relation VC on CITY attribute, an order relation 
< on STATU S, and a discrete (identity) relation on each of the rest of attributes. 
If we had forgotten the semantic relations in CITY and STATUS attributes. 
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Table 1. The Supplier Table 



SNUM 


SNAME 


STATUS 


GITY 


Si 


Smith 


TWENTY 


B 


S2 


Jones 


TEN 


G 


S3 


Blake 


TEN 


G 


S4 


Clark 


TWENTY 


B 


S3 


Adams 


THIRTY 


D 


S'e 


Peterson 


FORTY 


E 


Sr 


Ewing 


EIGHTY 


F 


Ss 


Johnson 


EIGHTY 


F 


Sc, 


Pike 


FORTY 


E 


Sio 


Meyers 


NINTY 


G 



Table 2. 1/C-Binary Relation 



GITY 


GITY 


B 


B 


B 


G 


G 


G 


G 


B 


G 


D 


D 


D 


D 


E 


E 


E 


E 


D 


F 


F 


F 


G 


G 


G 



Table 3. Binary Granulation and Neighborhood System 



GITY 


Elementary 

neighborhood 

(granule) 


Elementary 
concept 
(granule name) 


B — ^ 


{G,B} 


= J 


G — ^ 


{G,B,D} 


= K 


D — ^ 


{D,E} 


= L 


E — ^ 


{D,E} 


= L 


F — ^ 


{F,G} 


= M 


G — ^ 


{G} 


= N 
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then one would think that there were an extensional, (two directions) func- 
tional dependencies, namely, CITY — > STATUS and STATUS — > CITY. 
However, for example, B and C are symmetrically HC-related {B VC C and 
C VC B), but their images, TWENTY and TEN, are not symmetrically re- 
lated {TWENTY i. TEN and TEN < TWENTY). So only the map 
STATUS — >■ CITY respects the semantic relations. In such a semantically 
richer situations, a map should not be treated as a functional dependency unless 
it respects the semantics; see Sction|S]The only functional dependency, called 
weak dependency, in Table Eis STATUS — > CITY. 

3 Binary Relations and The Induced Equivalence 
Relations 

We will recall some theory of binary relations from . 

3.1 Binary Relations and Binary Neighborhood System Spaces 

Binary Relation. Let B C V x E be a binary relation on V. For each object 
p gV, we associate a subset 

N EICHb{p) = {u\p B u}, 

called elementary (binary) R-neighborhood NEIGHb{p) (Note: elementary 
neighborhood is used as a generalization of elementary set (= equivalence class) 
in rough set theory, and binary neighborhood is used to reminding binary rela- 
tion). Note that each point has only one elementary neighborhood (intuitively 
it can be viewed as the nearest neighborhood) of p. This association defines a 
map, called binary R-granulation: 

B-.V — >2^ -.p — ^ NEIGHb{p). 

Next we gather all the R-neighborhoods together, i.e., we set 

NEIGHb{V) = {NEIGHb{p) I Vp G 1/} 

and call the collection a binary B -neighborhood system (BNS). 

Conversely suppose NEIGHb{V) = {NEIGHb{p)} is given, we can get 
the binary relation back by defining 

B = {{p,u)\ u G NEIGHb{p)}. 

We summarize the discussion to a 

Proposition. There is a one-to-one correspondence between binary neighborhood 
systems, binary granulations and binary relations. 

So from now on, we will treat them as synonyms and use them interchangeably; 
we will use the same notation B for all of them. In particular, we will write 



NEIGHb{V) = B{V) and NEIGHb{p) = B{p) 
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If the binary relation B is an equivalence relation E, then the binary gran- 
ulation is a partition. An elementary set is the elementary A-neighborhood of 
its members (in rough set theory an equivalence class is called an elementary 
set). We will use elementary granule to denote both elementary neighborhood 
and elementary set. For B = FC, it is illustrated in Table |3 

Binary Neighborhood System Space. The pair (V,B) has been called a 
binary neighborhood system space (BNS-space) [7|, |0|- A BNS-space is a pre- 
topological space; it is a variant of Frechet(V)-space ^3|. In the case that B is 
an equivalence relation E, (V, E) is a clopen topological space and is the focus 
of rough set theory [S|. We will collect few simple properties of BNS-space. Let 
B' be another binary relation. 

Definition 

1. A subset X is called a definable B-neighborhood, if A is a union of elemen- 
tary (binary) neighborhoods of B; 

2. The set of all definable B-neighborhoods at p is denoted by BS{p); The set 
of all definable B-neighborhoods in U is denoted by BS{U). 

3. Let S' be a subset of V. NEIGH{S) = Upprax ^(P) is called the elementary 
B-neighborhood of S; note that it is a definable neighborhood. 

4. B' weakly (or continuously) depends on B iff every elementary B- 
neighborhood at p is contained in B'-neighborhood; in other words, the 
identity map is continuous 

5. B' strongly (or definably) depends on B, denoted by B B', iff every 
elementary B'-neighborhood is a definable B-neighborhood; we say B is 
definably finer than B' or B' is definably coarser than B. 

Strongly dependence is an elaborate extension of refinement of equivalence re- 
lations. The obvious extension, weak dependence, has no desirable properties of 
’’functional” dependency (or knowledge dependency). 

3.2 The Induced Equivalence Relations 

Classification is often an important knowledge for human. Its mathematical term 
is partition or equivalence relation. Though a binary neighborhood system (bi- 
nary relation) is not a partition(equivalence relation), we will show that it does 
induce a partition(equivalence relation). 

Given a map / : X — > Y : x — > y, its family of complete inverse image 
f~^{y) forms a partition, e.g., see jO]. Here we take the map, binary granulation, 

B-.V — >2^ -.p — ^ B{p). 

The inverse image B“^(B(p)) is called the center of B{p). The family of all 
centers forms a partition, called induced partition of a binary relation: 

Eb = {B~^{B{p)) I for all elementary neighborhood B{p) in E } 

By abuse of language, we may also call each member of the center a center. 
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1. The binary granulations or binary neighborhood systems ISection 14.211 

a) VC : Dom{A) — ^ 

b) VC o VC : Dom{A) — ^ 

c) VC oVC oVC : Dom{A) — ^ 
induces 3 partitions; see Table El |SI and El 

2. Each elementary set (=equivalence class) of the 3 induced partitions is the 
center of the corresponding elementary neighborhood. 

3. Each distinct center of an induced partition is labeled by a distinct elemen- 
tary concept. Since all centers of a induced partitions are mutually disjoints, 
they are independent to each other (within an attribute doamin). 

4. Each distinct elementary neighborhood of a binary granulation is labeled 
by a distinct elementary concept; However, elementary neighborhoods may 
overlap, so these elementary concepts are only syntactically distinct, but se- 
mantically may he related. For example, K and J are distinct, but semanti- 
cally related; however, the label of the centers, Kp and Jp are distinct and 
semantically independent; see Tabled, E3 El 

5. If EC-binary relation is an equivalence relation, then VC = Eye and hence, 
these elementary concepts are both syntactically and semantically indepen- 
dent. 

6. In databases, elementary concepts are referred to as attribute values. 



Table 4. EC-Binary Granulation and Induced Partition 



Center (Elementary 
set of EvC) 




Elementary neighborhood 
of EC 


II 


^ 


{C,B} = J 


{C} = Kp 


^ 


{C,B,D} = K 


{D,E} = Lp 


^ 


{D,E} = L 


{F} = Mp 


^ 


{F, G} = M 


{G} = Np 


^ 


{G} = iV 



4 Generating Concept Hierarchies/Networks 

4.1 Concept Hierarchies/Networks 

In jSj , a concept is recursively defined to be: 

1. a label (base concept) for each distinct attribute value, for example, the first 
column of Table 0 

2. a label (parent concept) for a set of concepts (referred to as sibling con- 
cepts), which are not mutually recursive. For example, see the second or 
third columns of Table 0 
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Table 5. VC o 1/C-Binary Granulation and Induced Partition 



Center (Elementary 
set of Evcovc) 




Elementary neighborhood 
of VC o VC 


{B} = Kp 


^ 


{C,B,D} = K 


{C} = Op 


^ 


{C,B,D,E} = O 


{D,E}^Lp 


^ 


{D,E} = L 


{A} = Mp 


^ 


{E, C} = M 


{G} = Np 


^ 


{G} = A 



Table 6. VC o VC o YC-Binary Granulation and Induced Partition 



Center (Elementary 
set of Evcovcovc) 




Elementary neighborhood 
of VCoVCo VC 


{B,C} = Op 


>■ 


{C,B,D,E} = O 


{D,E} = Lp 





{D,E}^L 


{A} = Mp 


> 


{F, C} = M 


{G} = Np 





{G} = N 



Higher level eoneepts derived by part 2 are typically specified by an outside 
source. In this paper, the set of higher level concepts will be generated automat- 
ically, only the interpretation of labels may be provided by an outside source. 

A traditional concept hierarchy is deterministic. Every concept is the child 
of at most one higher level concept. In such a hierarchy, the base concepts are 
grouped into a nested sequence of partitions: A level zero concept is a base 
concept (distinct attribute value). A level one concept is an equivalence class of 
the innermost equivalence relation;a level two concept is that of the second in- 
nermost relation, and etc. In contrast, the new approach relax the equivalence 
relations to general binary relations. The new concept hierarchy/network is syn- 
tactically deterministic, but may be semantically non-deterministic. In such a 
hierarchy/network, the base concepts are grouped into a nested sequence of bi- 
nary neighborhood systems (binary relations); here the nesting is in the sense of 
strong dependencies Section ^21 Note that this nested sequence induces a nested 
sequence of induced equivalence relations. So we have two nested sequences: Both 
sequences have the same level zero concept; they are attribute values. In level 
one, each sequence has its own concept: one is an elementary neighborhood of 
the innermost binary relation, the other one is the corresponding center. Next 
proceed to second innermost, and etc. So we have a hierarchy and a network. 
In the first sequence the siblings no longer form a partition. Nevertheless each 
point (attribute value) still only have a unique elementary neighborhood. So each 
lower concept is syntactically group into a unique higher level concept. However, 
a concept may be covered by several elementary neighborhoods, so semantically, 
it has several ’’parents.” 
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4.2 Generating Hierarchy for Strong Dependencies 

Let X, B, and o be a finite set, a binary relation and composition of binary 
relations respectively. 

Propositions 

1. if B is reflective, then 

B — >B^ = BoB — > B^ = BoBoB,..., — > B^ (*) 

is a strongly dependent sequence ; see Section l,S. II 

2. if B is reflective and symmetric, then there is the smallest m such that B^ 
is an equivalence relation (transitive closure). 

3. the generated sequence (*) is a concept hierarchy /network ; see Table 0 

4. (*) induces a nested induced equivalence relations 

Eb — Eb^ — Eb3, . . . , — ifs"* (if*) 

Comments on Table 0 

1. Table 0 illustrates the nested sequence (*) for the binary relation in Table0 

2. Note that syntactically. Table 0 is a hierarchy. For example, O and L are 
syntactically independent, but semantically L G O. 

3. Also note that C syntactically has a unique parent K (elementary neigh- 
borhood), but semantically it has another ’’parent” J (sine C is an element 
of J). Note that {C,B} = J is the unique elementary neighborhood of B, 
not for C. C’s unique elementary neighborhood is {C,B,D} = K. 

Comments on Table |H1 

4. Table 0 illustrates the nested induced partitions (A*); a hierarchy in the 
traditional sense. 

5. Table 0 and Table IHl are syntactically two way isomorphic, but semantically 
there is only a one way map from Table 0 to Table 0 Note that semantically 
L C O, but Lp and Op are independent. 



5 Mining High Level Data 

5.1 Real World Interpretations of VC-Binary Relation 

1. We interpret VC-binary relation as ’’one hour drive.” Since some public 
roads may be ’’one way street”, so VC, is reflexive, but not necessarily 
symmetric nor transitive. 

2. VC-elementary neighborhood is the ’’nearest neighborhood”. We will define 
it by the measurement ’’one hour drive;” see Table 0 
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Table 7. Concept Hierarchies/Networks Based on Binary Granulations 



CITY 


MeaningfulN ame 
of VC - 
Elementary 
neighborhood 


MeaningfulN ame 
of VCoVC - 
Elementary 
neighborhood 


MeaningfulN ame 
of VC oVC oVC - 
Elementary 
neighborhood 


B 


— ^ J 


— > K 


— ^ O 


C 


— > K 


— ^ O 


— ^ O 


D 


— ^ L 


— ^ L 


— ^ L 


E 


— ^ L 


— > L 


— ^ L 


F 


— > M 


— > M 


— > M 


G 


— > N 


— > N 


— ^ N 



Table 8. Concept Hierarchies by Induced Partitions 



CITY 


MeaningfulN ame 
of Eve - 
Elementary set 
(the center) 


MeaningfulN ame 
of EvCoVC - 
Elementary set 
(the center) 


MeaningfulN ame 
of EvCoVCoVC - 
Elementary set 
(the center) 


B 


> Jp 


^ Kp 


— y Op 


C 


^Kp 


^Op 


— Op 


D 


— Lp 


— Lp 


— > Lp 


E 


— >■ Lp 


— >■ Lp 


— >■ Lp 


F 


— > Mp 


— > Mp 


— ^ Mp 


G 


— > Np 


— > Np 


— > Np 



3. Each y C-elementary neighborhood assigned a symbol that represents a 
meaningful name. For example, L is the name of an ’’informal city L” (for 
example Greater San Francisco) that includes two distinct cities, D and E 
(San Francisco and Berkeley); and D and E are the centers of L. 

Next items are about VC o FC-binary relation. 

4. VC o yC-binary relation is ’’two hours drive.” 

5. yC o yC-nearest neighborhood is the ’’informal county” which contains 
’’informal city.” For example, K is the name of a ’’informal county K” (for 
example, Silicon Valley) that includes informal city J (Greater San Jose) 
and real city C (Palo Alto); and J and C are part of the centers of K. 

Next items are about VC oVC o yC-binary relation. 



6. VC o VC o yC-binary relation is ’’three hours drive.' 
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7. VC o VC o yC-nearest neighborhood is the ’’informal state” which contains 
’’informal counties.” 

Next items are about the binary granulations. 

8. Table El is the proposed concept hierarchy /network defined by the nested 
binary relations VC, VC o VC, VC oVC oVC. 

9. All symbols in this proposed concept hierarchy/network are syntactically 
distinct and semantically may be related. 

Next items are about the induced partitions. 

10. Table rm is the traditional concept hierarchy defined by the nested equiva- 
lence relations Evc,EvcoVC,EvcoVCoVC- 

11. All symbols in this traditional concept hierarchy are syntactically and se- 
mantically independent. 

12. Each induce equivalence class represents the center cities of respective 
elementary neighborhoods of nested binary relations. 

Next items are about the relationship between granulations and partitions. 

13. There is a semantic preserving map from the concept hierarchies/networks 
of Tabled to Table Ca and an inverse map on syntactical level. 

5.2 High Level Rules 

To mining high level rule, we create high level tables Table El and TableE]from 
Tabled using new concept hierarchy /network induced hy VC (Table EJj S'lid 
traditional concept hierarchy induced by EyC (Table |H|). Here we give verbally 
typical AOG results from the two hierarchies, namely 

Rule 1: From Table mi we conclude that the suppliers’ status is high on the 
real city Mp = {F} and low on the real city Op = {B} 

Rule 2: From Table El we conclude that the suppliers’ status is high on the 
informal city M = {E, G} and low on ’’informal state ” O = {C, B, D, E} 

These two Rules are related. Let us interpret them as follow. People live in an 
informal area (city, county, state) and work at its center. The Rule 1 is saying 
about the pay in centers, while Rule 2 is about the income of individuals resided 
in informal areas. So to discuss cooperate pay. Table E3 is better, to discuss 
family average income. Table 0 is better, (couple may work at different cities in 
the same informal area). 

6 Conclusion 

In practical databases, attribute values often carry additional semantics specified 
by binary relations, such as numerical orders, physical distances and etc.. In this 
paper, we demonstrate that one can use such semantics to generate concept 
hierarchy/network and to mine high level rules. We believe this paper provides 
an initial foundation for a new promising data mining technique. 
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Table 9. The Supplier Table Granulation Level 2 



SNUM 


SNAME 


STATUS 


VGoVG - 
Elementary 
neighborhood 


Si 


Smith 


TWENTY 


K 


S2 


Jones 


TEN 


O 


Ss 


Blake 


TEN 


O 


Si 


Clark 


TWENTY 


K 


S's 


Adams 


THIRTY 


L 


Se 


Peterson 


FORTY 


L 


S7 


Ewing 


EIGHTY 


M 


Ss 


Johnson 


EIGHTY 


M 


Ss 


Pike 


FORTY 


L 


Sio 


Meyers 


NINTY 


N 



Table 10. The Supplier Table Partition Level 2 



SNUM 


SNAME 


STATUS 


EvCoVC- 

Elementary 

neighborhood 


Si 


Smith 


TWENTY 


Kp 


S2 


Jones 


TEN 


Op 


S3 


Blake 


TEN 


Op 


Si 


Clark 


TWENTY 


Kp 


S3 


Adams 


THIRTY 


Lp 


Se 


Peterson 


FORTY 


Lp 


S7 


Ewing 


EIGHTY 


Mp 


Ss 


Johnson 


EIGHTY 


Mp 


Ss 


Pike 


FORTY 


Lp 


Sio 


Meyers 


NINTY 


Np 
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Abstract. With the rapid growth in size and number of available databases, the 
manipulation of large concept hierarchies that cannot be fit in main memory 
becomes more and more frequent. Several representations of concept 
hierarchies are possible, for example tree, lattice, table, linked list, etc. In this 
paper, we propose an efficient implementation technique to manipulate large 
concept hierarchies. We use a lattice data structure to represent concept 
hierarchies and encode such a lattice into a boolean transitive closure matrix. A 
set of lattice operators are defined and implemented as abstract data types on 
the top of an object-relational database management system, and are used to 
perform generalization and specialization operations. We show the efficiency of 
the lattice operators to perform generalization and specialization in large 
concept hierarchies and compare their performance with the START WITH and 
CONNECT BY clauses of SQL. 



1 Introduction 

A concept hierarchy defines a sequence of mappings from a set of low-level concepts 
to higher level, more general concepts [1]. Such mappings may organize the set of 
concepts in partial order, such as in the shape of a tree (a hierarchy, a taxonomy), a 
lattice, a directed graph, etc. While in a strict concept hierarchy such as a tree or 
taxonomy, each concept has exactly one parent (super-concept) but in a lattice or 
directed graph hierarchy, there are many paths to a particular concept. We believe that 
such a lattice or directed graph data structure has advantage over the tree data 
structure to represent real world concept hierarchies. In this paper, we use a lattice 
data structure to represent concept hierarchies 

Concept hierarchies are very important in data mining and data warehousing. In 
data mining, they allow knowledge discovery at different conceptual levels [2]. In 
data warehousing, concept hierarchies are necessary for operations such as drill-down 
and roll-up fact dimension [3]. With the rapid growth in size and number of available 
databases, the manipulation of large concept hierarchies that cannot be fit in main 
memory becomes more and more frequent. In order to improve the efficiency of a 
knowledge discovery process, effective implementation techniques to manage large 
concept hierarchies have to be proposed. 

In a relational database management system, large concept hierarchies are 
represented as a collection of tables. Each table has two columns, which respectively 
represents a node and its direct parents. Generalization and specialization are then 
performed using the START WITH and CONNECT BY clauses of SQL [4]. 
Operating in a depth first search manner, such operators are inefficient with respect to 
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the response time in presence of large concept hierarchies. Furthermore, SQL lacks 
essential features for generalization and specialization of n input where n > 1. 

In this paper we propose a new approach to manipulate large concept hierarchies. 
Instead of representing them as a traditional collection of tables, we use a lattice data 
structure to represent concept hierarchies. We propose a new mechanism to 
manipulate large lattice data structures using efficient implementation techniques. We 
use a transitive closure encoding method to represent the lattice in the form of 
boolean transitive closure matrix. Encoding method leads in decreasing of almost 
constant time in performing generalized and specialized operations in the hierarchies. 
We propose an efficient method for representing such a matrix in an object-relational 
database where only the position of all elements that equals to 1 in the matrix is 
stored. We show that such a data structure is very efficient in terms of storage space 
used. Generalization and specialization of concept hierarchies are performed using the 
proposed lattice operators, which are UB, LB, Meet and Join. We show their 
performances in terms of efficiency of the generalization and specialization 
operations, and the ability to handle large concept hierarchies, which cannot fit in 
memory. Lattice operators have been tested against the NCBI/Genbank molecular 
databases [5]. NCBI database contains relationships between more than 80,000 
organisms spanning up to 35 levels. The experimental results show that our new 
operators are more effective with respect to the response time of performing 
generalized and specialized operations compared to the results obtained using the 
START WITH and CONNECT BY operators. 

The paper is organized as follows: Section 2 gives basic definitions of concept 
hierarchy, lattice and its operators. Section 3 describes how concept hierarchies are 
represented and queried in an object-relational database using START WITH and 
CONNECT BY clauses of SQL. Section 4, presents how lattice can be encoded into a 
boolean matrix of binary words, and how such a matrix can be efficiently represented 
in object-relational databases. Section 5 explains the implementation of lattice 
operators and describes how generalization and specialization can be performed using 
lattice operators. Section 6 compares performance between the new lattice operators 
and the START WITH-CONNECT BY clauses of SQL in term of efficiency of the 
generalization and specialization operations against large concept hierarchies. Section 
7 concludes the paper and presents future works. 

2 Basic Definitions 

The basic notions concept hierarchy, lattice and its operators are given in this 
section. Readers are referred to [6] for further information on this theory. 



2.1 Concept Hierarchy 

A concept hierarchy is a sequence of mapping from a set of lower-level concepts to 
their high-level correspondences [1]. Such mapping may organize the set of concepts 
in Partial Order, which the most general concepts is the null description whereas the 
most specific concepts corresponds to the specific values of attribute in the database. 
Let H be the hierarchy defined on a set of domains*, . . .,*^. Formally we have 
H,:.x...x.,^H.-...^H„ 
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Where H, represents the set of concepts at the primitive level, Hj , represents the 
concepts at one level higher than those at H,etc., and H„ represents the highest level of 
the hierarchy. 

2.2 Lattice and Its Operators 

A non-empty partial ordered set < L, • > is a lattice if Vx, y e L, each pair {x, y} 
always has the least upper bound denoted as Join ({x, y}) and the greatest lower 
bound denoted as Meet ({x, y}). In the following, lattice operators are given. These 
operators allow generalized and specialized operations to be performed within a 
hierarchy. 

UB/LB. Let P be a partial ordered set, and S c P. An element m g P is an upper 
bound of 5 if s • M for all s e S. A lower bound is defined dually. The set of all upper 
bounds of elements in S is denoted by UB (S) and the set of all lower bounds is 
denoted hy LB (S ). 

UB (S) = (u eP\ (Vs eS) s'u} LB (S) = (I eP\ (Vs eS) s-l} 

Join/Meet. Let P be a partial ordered set, and S ^ P. The join of elements in S 
when it exists is the least element of the set of common upper bounds of elements in 
S. In the same way, the meet of elements in S when it exists is the greatest element of 
the set of common lower bounds of elements in S. Hence the join and meet of 
elements in S can be, respectively determined by 

Join (S) = Min(UB(S)) Meet(S) = Max(LB(S)) 

The best way to understand how the concept hierarchy can be organized in the 
form of a lattice is to consider an example. In Figure 1, a concept hierarchy of 
organisms A, B, C, D, E, F and G is organized into a lattice, which represents the 
relationships between organisms. 

Once a concept hierarchy is represented using a lattice data structure, 
generalization and specialization of such a hierarchy can be performed using lattice 
operators, which are UB, LB, Meet and Join. According to the given example, we are 
able to answer the following queries: 

- Find all organisms that are more general than F? The answers are {A, B, C, D and 
E} by using UB operator. 

- Find all organisms that are more specific than A? The answers are {B, C, D, E, F 
and G) by using LB operator. 

- Find the least common super-concept of D and E? The answer is {B } by using Join 
operator. 

- Find the greatest common sub-concept of D and E? The answer is {G} by using 
Meet operator. 




Fig. 1. Representation of a concept hierarchy using lattice data structure 
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3 Querying Concept Hierarchies in Relational Databases 

In a relational database management system, we are able to represent concept 
hierarchies in a simple form of a collection of tables. Each table consists of two 
columns, which respectively represents a node and its direct parents. Relational table 
representing our concept hierarchy example is given in figure 2. The first column is 
used to store the identifier of a given organism and the second one of the same type is 
used to store its direct parent(s). It must be noticed that each node_id can have more 
than one parent_node_id that is the case when a node has multiple parents. From our 
concept hierarchy example in figure 1, organism D is a direct child of organism B and 
C. Thus, we store the identifier of its direct parent organisms that are D-B in one row 
and D-C in another. 



Node_id 


Parent_node_id 


A 




B 


A 


C 


A 


D 


B 


E 


B 


E 


C 


E 


D 


G 


D 


G 


E 



Fig. 2. Relational database table representing concept hierarchy 



3.1 Generalization and Specialization Operators in Relational Databases 

In a relational database system, the START WITH and CONNECT BY PRIOR 
clauses can be used to perform generalization and specialization against concept 
hierarchies in a dept first search manner. The START WITH clause specifies the root 
of the hierarchy while the CONNECT BY PRIOR clause sets up the relationships 
between elements in the hierarchy. The level function gives the distance between the 
root and the current node, starting with 1 for the root. The number of levels returned 
by a generalization/specialization query may be limited according to available 
resource (main memory). 

General syntax of a generalized/specialized query. 

SELECT column 1, column2, ... 

FROM tables [variable]... 

[WHERE condition] 

START WITH condition 

CONNECT BY [PRIOR] columnl =[PRIOR] column2 

[ORDER BY LEVEL] 

The CONNECT BY clause fixes the search direction in a hierarchy. Searching for 
elements in the hierarchy can be performed in two directions, which are specialization 
(sub-concept search) and generalization (super-concept search). In the following, 
examples of generalization and specialization queries using the START WITH and 
CONNECT BY clauses are given. 
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Query 1: Find all organisms that are more specific than organism A. 
Select node_id from Table 

Start with node_id = ‘A ’ Connect by parent_node_id = prior node_id; 
Query 2: Find all organisms that are more general than organism D. 
Select node_id from Table 

Start with node_id = ‘D’ Connect by node_id = prior parent_node_id; 



4 Encoding Lattice Data Structure 

In this section, we present an efficient implementation technique to represent large 
concept hierarchies. This is archived by plunging the hierarchy into a boolean lattice 
matrix of binary words. In [7], methods are described for implementing lattice 
operators for implementing type hierarchies in the context of object-oriented 
languages. Here, we use a transitive closure encoding method for effective 
implementation of our lattice operators. In the following, we start by presenting how a 
concept hierarchy can be encoded into a boolean transitive closure matrix of binary 
words. Then, an efficient method for representing such a matrix in an object-relational 
database is proposed. 



4.1 Boolean Transitive Closure Encoding 

The boolean transitive closure encoding is a mapping of a given lattice into a boolean 
lattice matrix of binary words. A 1 in the (i,j)"' element in the matrix means that 
element i is an ancestor of element j. Row i represents descendants of element i. 
Column j represents the ancestors of element j. The size of matrix is equal to n x n 
where n is the number of elements in the lattice. We name each row and each column 
of the matrix with the name of each element in the lattice. 

To encode the concept hierarchy into a matrix of binary words, we use the 
“immediate greater than” relation covered by the ordering in that hierarchy. This 
relation is obtained by computing all descendants of every element in the concept 
hierarchy. Therefore each row of the matrix contain I’s only in those columns headed 
by elements which are immediately less than the element heading the row; and it 
contains O’s otherwise. Thus, each row is a characteristic representation of the set of 
all its’ sub-concepts and each column is a characteristic representation of the set of all 
its’ super-concepts dually. For example, the boolean transitive closure matrix as 
shown in figure 3 can represent the concept hierarchy of figure 1. The row bit string 
code 0001011 for element D represents the sub-concepts of D = {D, F and G}. The 
column bit string code 1101010 for element F represents the super-concepts of F = 
{A, B, D and F}. Using these codes, it is possible to compute lattice operations using 
only logical AND operator on bit string, which we will describe in section 5. 



4.2 Boolean Transitive Closure Matrix Representation 

We have seen how a concept hierarchy can be encoded into a boolean transitive 
closure matrix of binary words. In the following, we explain how such a matrix can be 
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efficiently represented in an object-relational database. Given a concept hierarchy of n 
elements and each element is allocated its own code, then the concept hierarchy is 
encoded with n^ bits. In the presence of sparse matrix, such representation is 
inefficient with respect to the amount of storage space. In order to reduce the storage 
space of the boolean transitive closure matrix in an object-relational database, we 
propose an efficient representation by storing only the position of elements that equals 
to 1. 





A 


B 


c 


D 


E 


F 
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1 


1 


1 


1 


1 


1 


1 


B 


0 


1 


0 


1 


1 


1 


1 


C 


0 


0 


1 


0 


1 


0 


1 


D 


0 


0 


0 


1 


0 


1 


1 


E 


0 


0 


0 


0 


1 


0 


1 


F 


0 


0 


0 


0 


0 


1 


0 


G 


0 


0 


0 


0 


0 


0 


1 



Fig. 3. Boolean transitive closure matrix 

If the (i,j)'*' element in the boolean transitive closure matrix is equal to 1 then we 
store value i in the Rowind column and value j in the Collnd column. With this 
tabular representation, storage space of a matrix in an object-relational database 
system is by far reduced. From our example of boolean transitive closure matrix in 
figure 3, we are able to transform such a matrix into an object-relational database 
table using this technique as shown in figure 4. With this representation, we are able 
to retrieve row and column of the matrix using simple SQL statements. In the 
following algorithms, we describe the steps for retrieving row and column of boolean 
transitive closure matrix from the table stored in an object-relational database. 



Rowind 
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5 
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Collnd 
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7 


5 


7 


6 


7 



Fig. 4. Table representing boolean transitive closure matrix 



In algorithm 1 of figure 5, we propose a method for retrieving row bit string of the 
matrix from an object-relational database table. To retrieve row X of the matrix, we 
first select the Collnd from the table where the Rowind is equal to X. Then, we set 
value of bit string at the returned column position to 1 and the other positions to 0. 

To retrieve column bit string of the matrix, similar operations can be performed as 
described in algorithm 2. Getting column X of the matrix we first select the Rowind 
from the table where the Collnd is equal to X. Then we set value of bit string at the 
returned row position to 1 and the other positions to 0. 



5 Implementation of Lattice Operators 



In this section we discuss how lattice operators can be used to perform generalization 
and specialization of a concept hierarchy. Previously, we have seen how a concept 
hierarchy can be encoded into a boolean transitive closure matrix and how such a 
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matrix is represented in an object relational database. Using these codes, it is possible 
to compute lattice operations using logical AND on bit string. We use column bit 
string of the matrix to compute UB and Join operations and use row bit string to 
compute LB and Meet operations. From our matrix example in figure 3, we are able 
to compute UB, LB, Meet and Join operations of element D and E as shown in the 
following: 

UB {D,E} = 1101000 n 1110100= 1100000= {A, B} 

LB {D, E) = 0001011 n 0000101 = 0000001 = {G} 

Join {D,E} = 1 101000 n 1110100= 1100000= {B} 

Meet {D,E} =0001011 n 0000101 = 0000001 = (G) 



Algorithm 1. Get a row bit string of the 


Algorithm 2. Get a column bit string X of 


matrix 


the matrix 


Procedure GetRowCodefX:element); 


Procedure GetColCode(X:element); 


Returns : bit string code 


Returns : hit string code 


begin 


Begin 


CodeX = 0; 


CodeX = 0; 


S = Select Colind 


S = Select Rowind 


from matrix_table 


from matrix_table 


Where Rowind = X; 


Where Colind = X; 


For each i s S 


For each i e S 


CodeX(i)= 1; 


CodeX(i) = 1; 


Retum(CodeX); 


Retum(CodeX); 


End 


End 



Fig. 5. Algorithms for retrieving row (X) and column (X) of a given transitive closure matrix 



Steps for computing UB and Join operations are given in algorithm 3 and 4 of 
figure 6. To perform generalization of elements X and Y, we first retrieve the column 
bit string code of element X and Y from the matrix stored in a database using 
GetColCode procedure given in algorithm2 of figure 5. Then, we apply the logical 
AND operator to the retrieved codes. All elements that equal to 1 of the resulting code 
are super-concepts of X and Y. 

Computation of the Join operation is performed in a similar way, with the only 
difference is that the answer is the element in the matrix where its code matches the 
result of logical AND operator. In the case that the resulting code is the code of none 
element in the matrix, the minimal common super-concepts of X and Y is returned 
instead. 

Steps for computing the other lattice operations, which are LB and Meet, are given 
in algorithm 5 and 6 of figure 7. To perform specialization of elements X and Y, we 
first retrieve the row bit string code of element X and Y from the matrix stored in 
database using GetRowCode procedure in algorithm 1 of figure 5. Then we apply 
logical AND operator to the retrieved codes. All elements that equal to 1 of the 
resulting code are suh-concepts of X and Y. 

Computation of the Meet operation is performed in a similar way, with the only 
difference is that the answer is the element in the matrix where its code matches the 
resulting of logical AND operator. In the case that the resulting code is the code of 
none element in the matrix, the maximum common sub-concepts of X and Y is 
returned instead. 
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Algorithm 3. UB Operator 


Algorithm 4. Join Operator 


Procedure UB (X, Y:element); 


Procedure Join (X, Y:element); 


Returns : element 


Returns : element 


begin 


begin 


While X, Y is not null 


While X, Y is not null 


begin 


begin 


CodeX = GetColCode(X); 


CodeX = GetRowCode(X); 


CodeY = GetColCode(Y); 


CodeY = GetRowCode(Y); 


Result = CodeX* CodeY; 


Result = CodeX* CodeY; 


end 


end 


for i = 0 to CodeLength 


if Velement s Matrixl 


if Result(i) = 1 then 


code(element)=Result then 


return {element(i)}; 


returnfelement) 


end if 


else 


end 


return minimum 
( elementlResult(i)= 1 } ; 

end 



Fig. 6. Algorithms for UB and Join operators 



Algorithm 5. LB Operator 


Algorithm 6. Meet Operator 


Procedure LB (X, Y: element); 


Procedure Meet (X, Yielement); 


Returns : element 


Returns : element 


begin 


begin 


While X, Y is not null 


While X, Y is not null 


begin 


begin 


CodeX = GetRowCode(X); 


CodeX = GetColCode(X); 


CodeY = GetRowCode(Y); 


CodeY = GetColCode(Y); 


Result = CodeX* CodeY; 


Result = CodeX* CodeY; 


end 


end 


for i = 0 to CodeLength 


if Velement s Matrixl 


if Result(i) = 1 then 


code(element)=Result then 


return {element(i)}; 


returnfelement) 


end if 


else 


end 


return maximum 
( elementlResult(i)= 1 } ; 

end 



Fig. 7. Algorithms for LB and Meet operators 



6 Performance Evaluation 

To access their performance, lattice operators are implemented in C/C++ 
programming language provided by the Oracle ORDBMS, and compared against 
START WITH and CONNECT BY clauses of SQL. The platform we used was a 
SUN E3500 running SunOS version 2.5.1 with CPU clock rate of 366 MHz, 6144 MB 
of main memory and 9GB disk. Data is stored in Oracle database version 8.1.5 [8] 
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We ran our experiments using the NCBl/Gen bank molecular databases. NCBI 
database contains relationships between more than 80,000 organisms spanning up to 
35 levels. Indexes have been created on the two columns in order to have the best 
performance when using the START WITH and CONNECT BY clauses. It has been 
noticed that generalization and specialization with n input parameters where n > 1 are 
not supported by the START WITH and CONNECT BY clauses. In order to compare 
these operators with our lattice operators, with n input parameters, we have developed 
SQL-embedded procedures using PL/SQL to perform this task. 

We encoded the NCBI data into boolean transitive closure matrix and then stored 
such a matrix using the proposed matrix representation as described in subsection 4.2. 
The total size of the encoded hierarchy is 15MB where only position of elements 
equals to 1 is stored compared against the total size of 7.5GB if all elements of the 
matrix are stored. 

Figure 8, 9, 10 and 11 show the average CPU time (including disk access) of 
generalized and specialized operations given in microsecond. In order to study the 
behaviors of the proposed lattice operators and the START WITH-CONNECT BY 
clauses in various sizes of concept hierarchies, experiments were conducted on the 
database using different numbers of records ranging from 20,000 to 90,000 records. 
The numbers of input concepts for each operator are chosen randomly form the total 
number of existing concepts in the concept hierarchy. Eigure 8 shows the average 
CPU time used to perform generalization using UB operator compared with START 
WITH and CONNECT BY clauses, with respect to the different size of concepts. We 
have found that the UB operator outperforms the START WITH-CONNECT BY 
clauses at every data size. Furthermore, the result shows that increasing number of 
records in database also increase average execution time of operation to be longer for 
both new and existing operators. 
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Fig. 8. Average CPU time of generalization in various data size 

The average CPU time used to perform specialization using LB operator compared 
against START WITH-CONNECT BY clauses in different data size was shown in 
figure 9. The result shows that the average CPU time of new LB operator is faster 
than the START WITH and CONNECT BY clauses. Increasing number of records in 
database also increase average execution time of operation to be longer the same as 
UB operator. 
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Fig. 9. Average CPU time of specialization in various data size 

Figure 10 shows the average CPU time of Meet operator compared to START 
WITH-CONNECT BY clauses. For the existing operators, we observe that the more 
number of concept hierarchies is the longer execution time searching for the greatest 
common snb-concept. Furthermore, the execution time of START WITH-CONNECT 
BY clanses exponentially increases when the size of the concepts (number of record) 
is larger. Eor our Meet operators, we observe that its execution time is almost 
instantaneous (less than 6-7 milliseconds). The important difference between the 
execution time of both operators can be explained by the method used for searching 
the greatest common sub-concept in the hierarchy. Indeed, START WITH- 
CONNECT BY clanses operate in a depth first search manner, they have to move 
toward to every element in the hierarchy start at the root node searching for the 
greatest common snb-concept. Whereas, new Meet operator uses logical AND 
operator apply on the boolean transitive closnre matrix to perform the same operation. 
It can find the searched element in an almost constant time. The same observation can 
be made for the Join operator. Its execntion time is almost constant (less than 6-7 
milliseconds). 
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Fig. 10. Average CPU time of searching the greatest common sub-concept in various data size 

Figure 11 shows the average CPU time searching for the least common super- 
concept nsing Join operator and the START WITH-CONNECT BY clauses in 
different data size. 
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Fig. 11. Average CPU time of searching the least super-concept in various data size 



7 Conclusion 

In this paper, we use a lattice data structure to represent concept hierarchies and 
improve response time of generalization and specialization operations comparing 
against the START WITH and CONNECT BY clauses of SQL. The concept 
hierarchy is encoded into n x n boolean transitive closure matrix, where n is the 
number of elements in the hierarchy. Each element is encoded into binary words. We 
use only logical AND operator to apply on those binary words to perform lattice 
operators during generalization and specialization operations. In order to reduce 
storage space of the underlying n x n boolean transitive closure, we store only 
position of elements equal to 1. With this new representation, we are able to retrieve 
row and column of the matrix using simple SQL statements. The benefit of the new 
representation of concept hierarchies is to reduce time needed to perform 
generalization and specialization operations in the object-relational databases. This 
paper essentially presents an efficient method for manipulating large concept 
hierarchies. In the future, we plan to study how the proposed method effect efficiency 
in the mining of KDD process on specific data mining systems. 
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Abstract. In this paper we consider three alternative featnre vector 
representations of patient health records. The longitndinal (temporal), 
irregular character of patient episode history, an integral part of a health 
record, provides some challenges in applying data mining techniques. 
The present application involves episode history of monitoring services 
for elderly patients with diabetes. The application task was to examine 
patterns of monitoring services for patients. This was approached by 
clustering patients into gronps receiving similar patterns of care and 
visualising the features devised to highlight interesting patterns of care. 



1 Introduction 

We are interested in the problem of clustering individuals given observed data 
about the individuals where the observed data does not naturally occur in vec- 
tor form. Clustering algorithms are typically applied to data in vector form. For 
example, we may have fc-measurements on a set of patients and so the mea- 
surements on each individual i are represented as a A:-dimensional vector. For 
vector-form data well-known and widely-applied clustering techniques can be 
applied. Such techniques are generally model-based methods include mixture 
modelling |^, or distance-based methods 0. 

Much real world data is actually in non-vector form consisting of observa- 
tions of an individual, recording information at particular time points. Such 
variable-length event sequence data is described in Sect.0 but examples include 
a patient’s usage of medical services and an individual’s stock trading behaviour. 
The data is characterised as irregular events where each event may encapsulate 
a different type of action. 

The data mining practitioner wishing to cluster event sequence data appears 
to have three options. The first option is to convert the event sequence data into 
feature vectors 0. A problem with this approach is that information is inevitably 
lost in the vectorisation process. The second option is to use a distance-based 
clustering method which allows for non-vector data. An edit-distance metric jS] 
which uses insert, delete and replace operations to turn one sequence into another 
is an example of this approach. A difficulty here is in defining an effective distance 
metric. A suitable distance metric needs to be created for each new application. 
The third option is the use of mixtures of a generative probabilistic model HH 
This is an attractive approach but not further explored here. 
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We chose the first option for the application described in this paper. An aim 
was to minimise the loss of information relevant to the data mining objectives 
in choosing the feature vectors. We present three alternative feature vectors for 
representing medical event sequence data. Our exploration provides insights into 
the process of developing alternative feature sets. We identify feature sets that 
are useful for clustering event sequence data. 

Sect. 121 describes the patient health record data and Sect. 0 describes the 
objectives for investigating patterns of care received by patients. Sect. 0 describes 
the feature vectors we have used in looking for patterns of care. To the best of 
our knowledge two of the three feature vectors we use here are novel. Clustering 
results and their visualisations are presented in Sect.lSl 

2 Health Care Data 

Medicare is the Australian Government’s universal health care system. Each 
visit to a medical practitioner or hospital is covered by Medicare and recorded 
as a transaction in the Medicare Benefits Scheme (MBS) database. This data 
has been collected in Australia since the inception of Medicare in 1975. Such a 
massive collection of data provides an extremely rich resource that has not been 
fully utilised in the exploration of health care delivery in Australia. 

For this current exploration we use a subset of de-identified data (to pro- 
tect privacy) based on Medicare transactions from Western Australia (WA) for 
the period 1994 to 1998. Our particular focus is on patterns of care related to 
diabetes for elderly patients (over 65 years of age). We have only limited demo- 
graphic information about each patient, such as age, gender and location. For 
each patient we also have the sequence of diabetes-related monitoring tests they 
have received over this time interval. 

The four monitoring tests included in our dataset are given in Table 0 
Glycated hemoglobin measurements (Gl) provide information about the accu- 



Table 1. Types of services received by Patients and indicative guidelines. 



Abbrev 


Description 


Guidelines 


Gl 


Quantitation of glycosylated hemoglobin. 


2-4 times per year 


Op 


Ophthalmologic examination. 


Every 1-2 years 


Ch 


Cholestorol measurement via lipid studies. 


Every year 


Al 


Microalbuminuria test 


Every year 



mulated effect of glucose levels. Ophthalmologic examinations {Op) are impor- 
tant in the early identification of complications related to eye sight. Gholesterol 
measurements via lipid studies {Ch) help identify possible complications relat- 
ing to heart conditions. Microalbuminuria tests {Al) provide early indications of 
possible future kidney function loss. 

A sample patient record is illustrated in Fig.0 The event sequence data can 
be augmented with any available vector based data. 
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Tf 



A1 



Ch 


Ch 


Ch 




Op 

Gl 


Gl 


Op Op 

Gl 


Op Op 

Gl 



1994 1995 1996 1997 1998 T, 



Fig. 1. A sample patient’s health record, showing the four types of tests received over 
five years. The tests are: glycated hemoglobin (Gl); ophthalmology (Op); cholesterol 
(Gh); and micro-albuminuria (Al). 



3 Patterns of Care in the Management of Diabetes 

An important area in health population research is the investigation of patterns 
of care received by patients. Are there distinct patterns of care for these diabetes 
patients? Are there groups of patients receiving similar patterns of care? Are the 
patterns of care related to their doctor? Do patients of different age or gender 
or location receive differing patterns of care to other patients? 

We have some prior expectations about the desired patterns of care for el- 
derly patients with diabetes. The Australian National Health and Medical Re- 
search Council (NHMRC) publishes clinical guidelines for looking after patients. 
Patients with diabetes are at risk of developing complications such as eye prob- 
lems, loss of kidney function and circulatory problems. The clinical guidelines 
recommend monitoring services, such as those in Tabled be carried out at cer- 
tain regular intervals. There is no compulsion for general practitioners to adhere 
to these guidelines and the guidelines cannot be expected to be appropriate for 
everyone. 

To complicate matters, published clinical guidelines can differ in their details 
from state to state, and from country to country. We use the NHMRC guidelines 
as our starting point, but refer to other guidelines where they differ and where 
they may have an effect on clinical practice in Western Australia. 

For example, according to NHMRC guidelines, glycated hemoglobin measure- 
ment should be done every six months (or every four months for some guidelines) . 
Ophthalmologic examinations should be done every two years (or annually for 
some guidelines). The cholesterol measurement via lipid studies should be per- 
formed once a year. The microalbuminuria test should be done annually. 



4 Selecting Features 

We now present three methods for mapping the non-vector sequence data onto 
feature vectors. The first is the obvious count approach of having one feature for 
each type of service representing the number of times the service was used. The 
other two methods, which we call average-residual- deviance and the gap, are less 
obvious and overcome some shortcomings of the eount method. 
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4.1 Count 

In the count feature vector approach we have one feature for each type of service. 
Each feature contains the number of services received. The original sequence 
data, as shown in Fig. QJ, is mapped to the features shown in Table 0 This 



Table 2. Count Feature Vectors. 



Patient 


Gl 


Op 


Ch 


Al 


1 


4 


5 


3 


1 


2 


5 


0 


0 


0 


2 


1 


1 


1 


1 


3 


16 


20 


17 


17 



feature representation has the advantage of being easily interpretable. However 
the obvious loss of information is a concern for the goals of our project. We have 
lost information relating to the time between successive services and also to the 
overall coverage of the services across the five years. For example, an individual 
with a count of 15 for a service appears to be well-monitored, but if those 15 
services all occurred in 1994 and none occurred in 1995, 1996, 1997 and 1998, 
then that is a pattern we would like to identify. 



4.2 Average, Residual, Deviance 

We have devised the average, residual, deviance feature vector for capturing the 
required temporal information missed by the eount feature vector approach. 

Let = 1,2, ...Uj) be the date of the ith service for service type j on 
patient k and nj be the total number of type j services received by patient 
k. Define Tf and T; to be the beginning and ending dates of the time interval 
covered by the study. 



Definition 1 (Mean Interval). The Mean Interval, for the patient k 

on service j when Uj > 0 is defined as : 



Mlj.k = 



i—1 ^i,jy 



n. 



- 1 



( 1 ) 



If Uj = 0 then Mljj^ = Tf — Ti. 



For example, we can calculate the Mean Interval for a patient who had thir- 
teen tests for Quantitation of glycosylated hemoglobin on the following dates: 



9044, 9272, 9377, 9527, 9592, 9766, 9875, 9985, 10101, 10154, 10334, 10413, 10510 

( 2 ) 
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where, for computational convenience, these dates are expressed as the number 
of days since January 1st, 1970. The interval (in days) between two consecutive 
tests are then 



228, 105, 150, 65, 174, 109, 110, 116, 53, 180, 79, 97 (3) 

For this patient we have Mlj^k = 228 + 105 + 150 + 65 +. .^+ii6+53+i80+79+97 ^ j^22.2. 

Definition 2 (Deviation Interval). The Deviation Interval,DIj^k, for patient 
k on service j for nj > 0 is defined as: 



Dlj^k = 






■ - ML fc)2 



Hj - 1 



(4) 



If Uj = 0 then Dlj k = Tf — Ti. 

For example, using the patient with the health record for a single test given 
in Eqn. ©, the Deviation Interval is 



^ /(228 - 122.2)2 (105 - 122.2)2 (150 _ 122.2)2 

Dink = y ^ 



= 47.9 (5) 



Definition 3 (Residual Time). The Residual Time, RTj^k, for patient k on 
service j is defined as: 

DInk = tij-Tf + Ti-t^^^ ( 6 ) 

For example, using the patient with the health record given in Eqn. Q, we 
have Tf = 8765 (January 1st, 1994) and Ti = 10592 (December 31st, 1997), so 
that the Residual Time is Dlj^k = 9044 — 8765 + 10592 — 10510 = 361 

The feature Mean Interval measures the average interval between receiving 
the same service. The Interval Deviation measures whether the service intervals 
are regular or irregular. The third feature provides a way of accounting for the 
windowing effects of having data for 5 years only. The time interval from the 
window boundary to the time of the first service and from the last service to 
the window boundary are not considered in the definition of the first two feature 
definitions. The third feature is used to account for these boundary effects. 

The feature vector for a service should have reasonably small values for all 
three features if the patient is treated according to the clinical guidelines. Typ- 
ically, some patients do need more frequent services as their diabetic condition 
is serious. We do not consider the possibility of over-servicing by medical prac- 
titioners, where more services than are clinically necessary are provided. 

Patterns of care contrary to the clinical guidelines can arise from insufficient 
numbers of services provided over the five years. This type of pattern is detected 
by the count feature vector and by a large Mean Interval value. The average, 
residual, deviance feature vector also represents patterns of care where the ser- 
vices provided are clustered in time, or are absent near the boundaries of the 
time window. 

The features are still relatively easy to interpret. However, we now need 
twelve features in our present application instead of the four for the count feature 
vector. 
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4.3 Gap 



Our third feature vector representation is the most specific to the task of in- 
vestigating service patterns with reference to service clinical guidelines. The 
motivation is to describe the total length of time when the regular required tests 
are not carried out. 

Once again, let be the date of the ith service for service 

type j on patient k and rij be the total number of type j services received by 
patient k. Define Tf and Ti to be the beginning and ending dates of the time 
interval covered by the study. 

We require that service type j have a desirable gap, DGj, as given by some 
clinical guidelines. 

Definition 4 (Gap). If patient k has Uj = 0 (the patient has no services) then 
the gap, Gj^k, is defined as: 



G,- fe = Ti-Tf- DG, 

If patient k has nj > 0 (the patient has one or more services) 



^initial 

^j,k 



(o' 



,j-Tf-DG, iftl^-Tf>Q 
otherwise 



The following counts the time intervals between services received: 



'^3,k - 



^ I i^+1,3 - th - ^^3 ^ > DG 3 



0 



otherwise 



We then include the final service interval: 

G final — 



^0 

The three Gap sub-parts are now summed: 

n-j — 1 



otherwise 



^nj,j ^ ^^3 





j,k — 



initial 



Eg;, 



^ final 
k 






(7) 

( 8 ) 

(9) 



( 10 ) 



( 11 ) 



For example, using the patient with the health record given in Eqn. 0, and 
assuming that the DG\ = 120 for the Quantitation of glycosylated hemoglobin 
test {j = 1). As shown previously, Tf = 8765 and Ti = 10592. The Gap between 
Tf and the first test is 9044 — 8765 = 179 which is greater than DG\, so it 
contributes 179 — 120 = 59 to the sum. There are four time intervals exceeding 
DGi among the 12 time intevals. They are 228, 150, 174 and 180 days respec- 
tively. Their contribution to the sum is 108, 30, 54 and 60 days respectively. The 
last test was done on day 10510 and the gap between T; and the last test is 
10592 — 10510 = 82, which is less than DG\ and therefore contributes nothing 
to the sum. Therefore we have Gy^ = 59-1- 108 -I- 30 -I- 54 -|- 60 = 311. 

The advantage of this feature vector is that it is low-dimensional and easy 
to interpret. This feature is particularly useful when there is an expectation of 
regularity in the events and this regularity is to be explored. 
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5 Results 

We used a model-based clustering program called Snob jYHj using a Bayesian 
mixture-modelling method with a Poisson distribution for the count feature 
vectors and a log-normal distribution for the average, residual, deviance feature 
vectors. 



5.1 Clustering Using Count 

Fig • Ogives the means and membership size of the 23 clusters found using Poisson 
mixture models. The Poisson distribution was suitable for these features because 
the counts are positive integers. Note that the counts are only approximately 
Poissson, because very large counts of services do not occur at all in practice. 
We now interpret these clusters. First recall that to receive care conforming to 
clinical guidelines for the Gl test over five years you would need between 10 
and 15 tests. The population mean is 5 tests. It is apparent from Fig. El that 
most individuals do not receive conforming care because the large membership 
clusters (e.g., 4, 5, 6, 7) have mean counts below 6. Only two clusters (e.g., 2 
and 20) have means of 10 or more. Cluster 2 individuals do not conform on the 
other two tests because they have means less than 3 for Op and Cl. In contrast. 
Cluster 20 individuals receive better than conforming care for all three tests. 
The 70 individuals in that group are apparently better looked after than all the 
others. The next best groups for all three tests are clusters 15, 16 and 17. 

In follow-up work we plan to examine the characteristics (e.g., number of CP 
consultations, whether they are in the community or a nursing home, number of 
days in hospital) of individuals in the very ‘good’ and very ‘bad’ clusters to see 
if they can be distinguished from those receiving other patterns of care. 



5.2 Clustering Using Mean, Residual, Deviance 

Fig. 0 gives the means and standard deviations of the 17 clusters found using 
log-normal mixture models. The log-normal distribution was suitable for these 
features because the mean, residual and standard deviation have positive contin- 
uous values. A mean interval of six months or so is indicative of care conforming 
to clinical guidelines for the Gl test. Looking at Fig. 0 we see that clusters 3, 
6, 7, 8, 9, 10, 15 and 17 have mean intervals less than 10. Cluster 17 has less 
than 10 members and so we ignore it for the moment. Individuals in clusters 
8 and 9 receive the best patterns of care for this population. The 700 cluster 
3 individuals receive regular conforming Cl tests, but very infrequent Op tests. 
In follow-up work we hope to characterise these individuals further. It may be 
possible to devise a policy to improve their quality of ophthalmology care. Op- 
posite to cluster 3, clusters 6, 7 and 9 receive frequent Op tests, but infrequent 
Cl tests. 

At the other end of the quality of care, the 850 individuals in cluster 16 
receive less care than the other individuals in the population. 
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Cluster 



Fig. 3. Clustering based on /textitaverage, residual, deviance features. Residual and 
Deviance features were used in the clustering but are not shown. 
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We now compare the clustering results from Fig. |3 with those from Fig. 0 
using a confusion matrix. The ith row of the confusion matrix contains the mem- 
bers of cluster i using Count. The individuals of cluster i are placed in column 
j if they belong to cluster j using Average, Residual, Deviance. If the two clus- 
tering approaches were identical, then one would expect the confusion matrix to 
contain one non-zero entry in each row and column. If the two clustering meth- 
ods are independent, then one would expect a relatively uniform distribution of 
non-zero entries. 

The confusion matrix is shown in Table 01 We see that there are indeed 
many zero entries indicating that the two clustering approaches result in related, 
but not identical, results. Intuitively, we would expect this if very high residual 
values are rare, thus making the count feature values highly correlated with 
the average feature values. The most interesting feature is that we consistently 
observe that individuals from a count cluster are distributed among one, two or 
three average, residual, deviance clusters. On closer examination, the average, 
residual, deviance clusters have similar mean average and deviance values, but 
differ in their residual value. This shows the value of using the residual feature 
to identify intensive patterns of care during a relative short time interval. 

Table 3. Confusion Matrix. The rows show individuals from a cluster using Count 
distributed among the clusters using Average, Residual, Deviance. NB: Clusters 16-23 
for Count and clusters 16-17 for Average, Residual, Deviance have been omitted. 
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5.3 Visualising the Gap 

Fig. E] provides a visualisation of the Gap for three different desired clinical 
guideline intervals {DG). The second of the three visualisations presents of Fig. 
El sets DG = 6 months. The distinct mode at zero indicates good conformance 
with the guidelines. In all three visualisations there is a mode around 20-24 
months, worthy of further investigation: Is there some structural feature in the 
health system that has patients receiving this test every two years, rather than 
not at all (in the worst case). In general we note a peak at Gap = 0 and another 
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Fig. 4. Visualisation of Gl Gap with DG = 3, 6 and 12 months from left to right. 

peak at the other end of the scale. These two peaks represent extremes: the 
first peak corresponds to conformance while the other peak corresponds to non- 
conformance to the guidelines. 

Fig.0 and Fig. El provide a visualisation of the Op and Cl tests for two differ- 
ent published clinical guideline intervals. At the right extreme are the individuals 
who did not receive any tests and so do not conform to the guidelines. At the left 
extreme are those individuals who conform to the guidelines. In between we see 
how patterns of care slowly degrade in terms of conformity. Note the mode at 12 
months on the left-hand side panels (where DG = 12 months). This is indicative 
of the population of individuals receiving a precisely conforming pattern of care. 




^ 800 ■ 



Fig. 5. Visualisation of Op Gap with DG = 12 and 24 months from left to right. 
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Fig. 6. Visualisation of Cl Gap with DG = 12 and 24 months from left to right. 



6 Conclusion 

We have considered three alternative feature vectors for representing variable- 
length patient health records. The feature vector of counts is the simplest, but 
can be misleading since it does not capture the distribution of patient care 
throughout the data window. The average, residual, variance feature vector over- 
comes this problem. For the specific task of characterising relationships to clinical 
guidelines, the gap feature vector most directly represents the required informa- 
tion. We expect the features created here for event sequence data for this health 
application will be applicable to other event sequence data such as trading and 
web log data. 

References 

[1] E. Arjas, H. Mannila, M. Salmenkivi, R. Suramo, and H Toivonen. Bass: Bayesian 
analyzer of event sequences. In Proceedings in Computational Statistics ( COMP- 
STAT’96), pages 199-204. Barcelona, Spain, Physica-Verlag, 1996. 

[2] I. Cadez and P. Smyth. Probabilistic clustering using hierarchical models. Technical 
Report 99-16, Department of Information and Computer Science, University of 
California, Irvine, March 1999. 

[3] A. Jain and R. Dubes. Algorithms for Clustering. Prentice-Hall, Englewood Cliffs, 
NJ, 1988. 

[4] Huan Liu and Hiroshi Motoda. Feature Selection for knowledge discovery and data 
mining. Kluwer, 1998. 

[5] P. Moen. Attribute, Event Sequence, and Event Type Similarity Notions for Data 
Mining. PhD thesis. Dept, of Computer Science, University of Helsinki, Finland, 
2000 . 

[6] J.J. Oliver, Baxter R.A., and Wallace C.S. Unsupervised Learning using MML. In 
Machine Learning: Proceedings of the Thirteenth International Conference (ICML 
96), pages 364-372. Morgan Kaufmann Publishers, San Francisco, CA, 1996. 

[7] C.S. Wallace and D.M. Boulton. An information measure for classihcation. Com- 
puter Journal, ll(2):195-209, 1968. 

[8] C.S. Wallace and D.L. Dowe. MML clustering of multi-state, Poisson, von Mises 
circular and Gaussian distributions. Statistics and Computing, 10:73-83, 2000. 






Boosting the Performance of Nearest Neighbour 
Methods with Feature Selection 



Shlomo Geva 

Smart Devices Laboratory 
Queensland University of Technology 
GPO Box 2434 
Brisbane Q 4001 
Australia 

s.geva@qut.edu.au 



Abstract. This paper describes a Nearest Neighbour procedure for variable 
selection in function approximation, pattern classification, and time series 
prediction. Given a training set of input/output vector pairs the procedure 
identifies a subset of input vector components that effectively capture the 
input-output relationship implicit in the training set. The utility of this 
procedure is demonstrated with numerous data sets from the UCI repository of 
machine learning databases and the Mackey-Glass time series prediction. A 
comprehensive set of benchmark problems is used to demonstrate comparable 
performance to that of much more complex boosted C4.5 decision trees. 



1 Introduction 

Subset selection is a special case of feature extraction. Given a training set of 
input/output vector pairs, one assumes that the observed output is a function of some 
subset (not necessarily unique) of input vector components. The objective is then to 
identify such a subset. A related problem, which was more extensively studied in 
relation to pattern classification problems, involves the weighting of input variables. 
Each variable is assigned a weight so as to improve classification accuracy. For 
instance, with methods that are based on distance metric, such as Nearest Neighbour 
classifiers or Radial Basis Functions networks, it is often useful to apply a 
transformation to the input vectors in distance calculations. The Mahalanobis 
distance (D^) supports an arbitrary linear transformation of the input vectors: 

(x,y) = II X - y 11^ = y/(x-yf v|/(x-y) 

A diagonal matrix v|/ is often useful when the components of the input vector are 
independent. The use of can sometimes produce better results than the use of 
Euclidean Distance, which is a special case of with \|/ = I. In practice however, it 
is usually difficult to determine v|/ on the basis of a-priori knowledge, and therefore 
this is done implicitly by an adaptive optimisation procedure. 
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Improvements in classification accuracy by several weight selection procedures 
was previously reported [2, 6, 9]. 

The Wrapper method [5] was also studied in the context of pattern classification. 
In that approach variables are assigned weights which are optimised through a search 
in weight space (ie, is used and a diagonal v|/ is determined empirically.) The 
fitness of a particular set of weights is measured by the accuracy that is achieved by 
an underlying induction algorithm. The subset selection is ’wrapped’ around the 
particular induction method, hence the name Wrapper. A comprehensive coverage of 
feature selection approaches can he found in [1]. 

A different subset selection problem arises in relation to time series prediction. A 
frequently used method for subset selection is based on Tokens Theorem [10]. A 
subset of time delayed co-ordinates is used, where d is the embedding dimension: 

X = {x{t),x{t - r),x{t - 

This approach assumes that the future dynamical behaviour of the whole system is 
largely dependent upon the time series itself, up to the present time; this is imperative 
if the procedure is to be effective. Reference [14] describes the False Nearest 
Neighbour method that describes how to discover an appropriate embedding 
dimension for phase space reconstruction. In this paper we describe a related, by 
very different method for the determination of a suitable subset of attribute that is 
based on nearest neighbour analysis of the data. The method is applicable to 
classification as well as to function approximation and time series prediction 
problems. 



2 Estimating the Nearest Neighbour Error 

The quantity that we will be looking to minimise, in searching for an appropriate 
subset of variables, is the Nearest Neighbour leave-one-out error E. It is closely 
related to the often-used leave-one-out cross validation error measure. Here we define 
it a little more broadly than usual to also cover function approximation tasks with 
vector, rather than scalar, output. The error is calculated as follows: 

1. Leave out one of the input/output pairs in the training set, say, {Xj , y j } 

2. Find X^^ the nearest neighbour vector of Xjin the training set. The 
approximation error of y j is now defined as 

=(yi-y„r(yi-y„)- 

where is the error of approximating y; at Xj by the function value 
y „ observed at X „ 

n n 
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3. Repeat steps 1-3, leaving out each of the N training set examples in turn, to 
accumulate the global quantity E: 



E = Z 



N 
i = 1 



e ■ 



The error E is an approximation of the nearest neighbour classification, or function 
approximation, error of the data set. E has a useful property that we can exploit in 
subset selection. Suppose that we have a training set derived with the actual subspace 
of M significant components in R*^. We compute the error E” in this subspace. Now 
consider the addition of an irrelevant component to the input vectors, so that they lie 
in R''^'^^. Computing the error E”^^ we observe a desirable property of E - one would 
generally expect to find that E^"^^ ^ E”. This is because the irrelevant component will 
cause the neighbourhood relationship among the vectors in R''^'^^ to be different to that 
in R*^. This means in some cases that the approximated function values will not be 
derived from the nearest neighbours in R''^ but rather from other more distant vectors. 
On average we would expect the approximation error to increase. Even in the 
presence of noisy training examples this should hold on average, as confirmed by our 
experiments 



3 Subset Selection 

Two different approaches are commonly used in subset selection. Backward 
Sequential Selection and Forward Sequential Selection [1]. The BSS procedure is 
somewhat more exhaustive and slower but sometimes produces more reliable results. 



3.1 Backward Sequential Selection (BSS) 

Starting from a training set with input vectors in R*^ we compute E” in all M 
subspaces R*^'*, each of which has one of the original M components left out. 
Following from our argument in the preceding section, we expect that when an 
irrelevant component is dropped we will find that E” ^ E^'^. It is possible that more 
than one of the components can be left out with that result. We eliminate that 
component which when dropped leaves a subspace having the lowest value of E”. 
The process is then repeated in the selected subspace until a stopping criterion is met. 
Our experiments reveal that the error can sometimes increase when a component is 
eliminated, only to continue and decrease as more components are eliminated. 
Therefore the process is repeated until all but one component is left. The subset along 
the path that gives the lowest leave-one-out nearest neighbour error E is selected. 

In practice it may be that one is interested in dimensionality reduction to the extent 
that one is prepared to accept some decrease in approximation accuracy. In that 
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situation it may be possible to identify a smaller than optimal subset which still 
provides an acceptable error rate. 

This procedure gradually eliminates components in reverse order of significance 
and its time complexity is quadratic in the number of attributes and quadratic in the 
number of training instances. Our experiments show that it can be done in a 
computationally effective manner with data sets having hundreds of components and 
thousands of training instances. The ideal process of computing E in an exhaustive 
manner for all possible subsets of components is computationally prohibitive because 
of the size of the power set involved. 



3.2 Forward Sequential Selection (FSS) 

This procedure starts with an empty subset. Each of the M components are tested to 
find out which produces the lowest error, when used in isolation. That component is 
then added to the subset and each of the remaining M-1 components are tested, in 
conjunction with the already selected component, to discover the pair that leads to the 
lowest error. This process is repeated, adding one component at a time, until the 
approximation error starts to increase. The FSS procedure is more economical than 
BSS with data sets having a large number of variables and where the actual subset of 
significant components is relatively small. However, it is generally unknown in 
advance whether this would be the case with a given data set. Although FSS is 
usually more economical it is often argued that FSS might be less effective in 
discovering combinations of attributes, which BSS is more likely to preserve. 



4 Experimental Results 

To test the effectiveness of the procedure we experimented with many benchmark 
problems from the UCI repository of machine learning databases and with the 
Mackey-Glass time series prediction problem in a function approximation scenario. 
Often with real world problems, one encounters the problem of missing values and the 
choice of a metric for nearest neighbour calculations. We have adopted several 
strategies to encounter these difficulties. In the case of missing values the distance 
calculations are performed in the available subspace. This can lead to some 
anomalies, but our experiments show that despite this the procedure is robust and 
performs well. All ordered numeric variables (discrete or continuous) were 
normalized to the range 0..1. The Euclidean distance calculations were then carried 
out as usual. Symbolic variables are often encoded as discrete numeric, but have no 
ordering associated with the actual values. With symbolic domains the distance 
calculation was based on the hamming distance (the number of different symbols). In 
problems where a mixture of numeric and symbolic variables were involved the 
distance calculation was also mixed. With this we were able to deal with all of the 
data sets in the UCI repository. 
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4.1 Pattern Classification 

We have experimented with 18 data sets from the UCI repository. These data sets 
cover many different types of problems having discrete, continuous, and symbolic 
variables. Some data sets have missing values, and some have a mixture of all the 
above. The 18 data sets are listed in Table 1. The rightmost column lists the size of 
the selected subset, averaged over 10 runs of BSS. 

Significant reduction in the number of attributes was obtained in all cases. The 
important question though is whether this reduction also leads to improved accuracy. 
We compare the classification accuracy results, of a nearest neighbour classifier - in 
the selected subspace - with results obtained with C4.5 (Quinlan’s implementation 
SeeS). It should be noted we used default training parameters to test C4.5. However, 
the error rates we obtained (Table 2) agree with results previously reported in [11]. 
We have also used the boosting option with a committee of 10 classifiers. Boosted 
C4.5 classifiers produced better results than C4.5 without boosting. C4.5 [12] is a 
well-understood tree building procedure and it is used in numerous publications as a 
yardstick to performance comparisons. All tests were conducted using 10-fold cross 
validation (10-fold-XV). The data was partitioned into 10 disjoint subsets having 
similar statistics with respect to class membership proportions. Using the training data 
set alone we then carried out subset selection. Only the variables that were selected as 
relevant were then used to test the training set as a Nearest Neighbour classifier (in 
the reduced subspace) against the held-out data. The results, averaged over all 10 
experiments are depicted in Figure 1. The performance of Nearest Neighbour 
classifier, in a suitably reduced subspace, compares well with boosted C4.5. 

The error rates of both procedures usually fall within one standard deviation from 
each other. This is a remarkably consistent result that seems to have been overlooked 
in the past. It demonstrates that a nearest neighbour classifier can often perform as 
well as a sophisticated committee of decision trees - provided that a suitable subset 
selection procedure is applied beforehand. The results depicted here correspond to 
the BSS procedure, but these are similar to the results obtained with the FSS 
procedure. In none of the data sets did we observe BSS to discover a subset of 
attributes that was not also discovered by FSS. Therefore, the conjecture that BSS is 
superior in discovering combinations of attributes is not confirmed by our results. 

To appreciate the improvement in Nearest Neighbour classification due to subset 
selection we also tested the accuracy of a classifier that is based on the full training set 
(i.e., in the full input space available with no subset selection). Again, we used the 
same 10-fold cross validation partitions to classify each partition by the remaining 9 
partitions. In addition we also used LVQ [13] to obtain a Nearest Neighbour 
classifier having a single prototype vector from each class. This of course represents 
a very limited classifier whereby linear discriminant functions separate the classes. 
We refer to this classifier as a Linear Machine. We used the error rate of the linear 
machine as a base-line error rate (it is the simplest nearest neighbour classifier 
possible). The error rates of all the methods were normalized by the error rate of the 
Linear Machine and are depicted in Figure 2. 




Relative Error Rate 
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Fig. 2. Normalized error rates 
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Table 1. Data sets used in subset selection for pattern classification 



Domain 


Number 
of Cases 


Number of 
classes 


Number of 
attributes 


Subset 

Size 


Auto insurance 


205 


6 


25 


7.3 


Breast cancer (Wise) 


699 


2 


9 


5.7 


horse colic 


368 


2 


22 


9.0 


Credit screening (Aus) 


490 


2 


15 


9.1 


Pima diabetes 


768 


2 


8 


4.9 


Glass identification 


214 


6 


9 


5.7 


Heart disease (Clev) 


303 


2 


13 


7.4 


Heart disease (Hun) 


294 


2 


13 


2.7 


Hepatitis prognosis 


155 


2 


19 


9.1 


Hypothyroid diagnosis 


3772 


5 


25 


7.5 


Iris classification 


150 


3 


4 


2.1 


Labor negotiations 


57 


2 


16 


7.5 


Image segmentation 


2310 


7 


19 


6.2 


Sick euthyroid 


3772 


2 


29 


9.1 


Sonar classification 


208 


2 


60 


12.2 


waveform 

differentiation 


300 


3 


21 


13.6 


Letter identification 


20000 


26 


16 


11.0 


Banding 


138 


2 


29 


13.5 



The complete results of 10-fold-XV tests are given in Table 2 where the best 
figures for each data set are shaded. It is evident that subset selection significantly 
improves the performance of a nearest neighbour classifier, which is otherwise 
significantly outperformed by a boosted C4.5 classifier. Perhaps more interesting is 
the fact that there are several data sets, on the right-hand side of the Figure 2, for 
which not only there is no advantage in using a non-linear classifier, but there is 
actually a distinct disadvantage to doing so. It also shows that boosting was not 
effective, in our experiments (using ensembles of 10 classifiers), in avoiding over- 
fitting of training data. 



4. 2 Function Approximation 

This Mackey-Glass benchmark problem has been widely used to test time series 
prediction methods. Lapedes and Farber [7] have used a multilayer perceptron trained 
on a short sequence of 500 points to predict future values of the series. Farmer and 
Sidorowich [3] have used a form of local regression to tackle the same problem. 
Given a segment of the time series sequence their procedure finds a set of similar 
segments in a long stored sequence. Prediction from the current sequence into the 
future then follows by linear regression on those similar past segments and the way 
they evolved. 





Boosting the Performance of Nearest Neighbour Methods with Feature Selection 217 



Table 2. Pattern classification error rates (%) 



Domain 


NN full set 


NN subset 


C5 boost 


C5 


Linear 

machine 


Sonar 


13.9 


14.3 


17.8 


22.0 


22.5 


Letter 


4.9 


3.9 


6.4 


13.0 


43.9 


Glass 


29.5 


24.3 


28.9 


30.0 


37.6 


Auto 


16.0 


15.0 


16.0 


17.6 


35.5 


Iris 


4.7 


4.7 


5.3 


5.3 


6.7 


Segment 


13.0 


3.2 


1.9 


2.9 


15.8 


Sick 


3.8 


1.8 


1.0 


1.3 


6.1 


Hypo 


2.5 


2.1 


0.8 


0.9 


3.6 


Credit-a 


17.3 


12.2 


11.4 


12.3 


13.8 


Breast-w 


4.7 


4.1 


3.6 


4.4 


4.2 


Banding 


26.2 


27.7 


24.6 


27.6 


26.2 


Colic 


18.9 


17.8 


15.7 


16.3 


16.4 


Heart-h 


22.8 


20.3 


22.0 


21.7 


16.6 


Waveform 


25.7 


25.0 


23.0 


25.7 


20.0 


Heart-c 


24.3 


22.1 


19.8 


25.1 


17.3 


Hepatitis 


18.7 


19.3 


18.1 


18.7 


14.7 


Labor 


22.0 


16.0 


18.3 


22.0 


12.0 


Diabetes 


29.3 


31.4 


24.2 


25.8 


22.4 



The Mackey-Glass differential delay equation is defined as: 



dx 

dt 



-bx(t) + a 



x(t - t) 

1 + x(t -ry° 



We used the values a=0.2 and b=0.1 As the value of T is increased the series 
exhibits a more chaotic behaviour. We have experimented with a value of X=30. The 
usual approach to the solution of this problem involves the prediction of x{t + T) ) 
from a vector of m -hi components: [x(t), x{t — S), x{t — 2S),..., x{t — mS)} . 
The values that are typically used for the prediction of the Mackey-glass equation 
with the choice X=30 are S = 6 and m = 5 . 

To demonstrate the utility of subset selection we tackled the problem without the 
knowledge of a suitable subset. Starting from a past sequence, we discover a suitable 
set of attributes to use in function approximation. We created a training sequence of 
500 50-dimensional vectors by sliding a window of size 50 along the given training 
sequence, with the value to predict being the next point, i.e. one step ahead. The input 
to the subset selection procedure was {x(t),x(t — V),x{t — 2),...,x{t — 49)} with 
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the function value sought being x(t + 1) . We conducted many experiments with 
statistically independent sequences of 500 points, using both BSS and FSS. The FSS 
procedure proved superior on this data set always converging on the subset x(t) and 
x(t — 30) . The BSS results varied slightly between experiments. We consistently 
obtained 5 or fewer attributes, which always included values from the sets 
{x{t),x{t — \),x{t — 2)] and {x(t — 29), x(t — 30), x(t — 31)} It was usually, 
but not always, the case that x{t) and x{t — 30) were the most significant - as one 
would expect - given the form of the Mackey-Glass equation (with longer sequences 
it is easy to obtain this result). A plot of the nearest neighbour error as subset selection 
progressed with BSS is depicted in Figure 3. It demonstrates how the prediction error 
decreases while irrelevant attributes are removed, and how a sharp rise is observed 
when significant attributes are removed. 

The False Nearest Neigbours (FNN) procedure is a related method that was 
designed to obtain the optimum embedding dimension for phase space reconstruction. 
[14]. Unlike our procedure, FNN requires a pre-determined delay constant (spacing 
between points), and it then discovers the number of delays that are required. 
Therefor, FNN is a more restricted procedure in that it requires prior analysis to 
determine the delay constant. 

To test the result obtained with BSS, which did not always converge on the ideal 
subset, we intentionally selected the input sequence x{t — 2)and x{t — 31) which 
appeared as the result in one of the subset selection experiments. This is not the set of 
inputs that appear in the Mackey-Glass differential equation for .x:(f -b 1) . The 
prediction accuracy for this subset was tested with 3 different methods of function 
approximation in predicting the time series from 1 to 400 steps ahead. 

A multilayer perceptron with two hidden layers (2:10:10:1) was trained to predict 
one step ahead using a PC with a MATLAB toolbox implementation of the 
Levenberg-Marquardt training algorithm (a solution was obtained in less than 1 
minute on a PC). The neural network was tested in predicting the future evolution of 
the series by using the iterated prediction method. Starting from an initial sequence, 
the next (future) value is predicted by the network. This predicted value is then 
appended to the initial sequence, which is then used to predict the next value in the 
sequence. This process is repeated to obtain 400 time steps ahead prediction. We 
conducted a set of 100 independent prediction experiments. Each experiment was 
performed with a statistically independent training sequence of 500 points and test 
sequence of 1000 points. Figure 4 depicts the results obtained showing the mean 
error rates over 100 independent experiments. The Error Index is plotted against the 
prediction time step. The Error Index is the root mean squared error, divided by the 
standard deviation of the sequence. 
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Mackey-Gla^s a±eel selection 




Fig. 3. Backward Sequential Selection for Mackey-Glass time series 

The results obtained with the neural network are very similar to the results obtained 
by Lapedes and Farber using the same network architecture [11]. However, our results 
were obtained with the input sequence { x{t — 2) , x(t — 31 )} obtained empirically 
by DSS. The results of Lapedes and Farber were obtained with the extended sequence 
of input vector components identified by analysis of the properties of the Mackey- 
Glass chaotic time series properties and using Tokens Theorem Using 
{ x{t) , x{t - 6) , x{t -\2) , x{t , x{t - 24 ) ,x(t -30) } -the sequence 
contains the actual variables that were used to generate the series. 

Also depicted in Figure 4 is the result of iterated nearest-neighbour approximation. 
In this approach we simply predict the next value in the sequence by using the value 
observed to follow the nearest neighbour vector in the training sequence. Iterated 
prediction is then used to predict multiple time steps ahead. Although less accurate 
than the neural network - as one would expect from a piecewise constant 
approximation of a continuous function - the approximation is still rather useful in 
selecting relevant variables. 

The third approach that we used is the regression approach of Farmer and 
Sidorowich [3]. Like the Nearest Neighbour approach it involves identification of a 
subset of past sequences that are most similar to the current sequence. The future 
value is then computed by the application of linear regression using the past 
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sequences and their future values. Farmer and Sidorowich used much longer 
sequences than were used by 



Comparisonol predclon meihods 




Predclonlime steps 



Fig. 4. The Error Index averaged over 100 independent segments predictions 

Lapedes and Farber, and they did not use the iterated prediction method. To obtain 
a more meaningful comparison we used the same short training sequences of only 500 
points in implementing the regression method. The relatively large number of points, 
presumably required for the regression method to work, was one criticism leveled at 
this approach, and we wanted to test the method with the same length sequence as the 
neural network. A search tree was used to locate nearest neighbour sequences so that 
searching for past sequences required very few distance calculations at tree nodes. To 
predict up to 400 steps ahead we again used the iterated prediction method. The 
results are depicted in figure 4. The regression method is only slightly less effective 
than the neural network. Lapedes and Farber indicated that both methods seem to 
perform similarly and our experiments confirm it by this direct comparison. 
However, by keeping all conditions equal, our results demonstrate that the regression 
method does not require larger training sets and longer sequences than the neural 
network does in order to perform as accurately - provided that a suitable subset of 
variables is used. 
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5 Conclusion 

This paper describes a subset selection procedure that is based on Nearest Neighbour 
analysis of the training set. We demonstrate the utility of the procedure in pattern 
classification and in function approximation problems. The procedure was tested on 
numerous data sets, small and large, from the UCI repository of machine learning data 
sets. Our results demonstrate that in almost every case it was possible to reduce the 
dimensionality of the input space. Furthermore, our results show that after subset 
selection a Nearest Neighbour classifier can often perform as well or better than a 
state of the art method such as a boosted C4.5 classifier. 
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Abstract. The selection of an appropriate inducer is crucial for per- 
forming effective classification. In previous work we presented a system 
called NOEMON which relied on a mapping between dataset character- 
istics and inducer performance to propose inducers for specific datasets. 
Instance-based learning was applied to meta-learning problems, each one 
associated with a specihc pair of inducers. The generated models were 
used to provide a ranking of inducers on new datasets. 

Instance-based learning assumes that all the attributes have the same im- 
portance. We discovered that the best set of discriminating attributes is 
different for every pair of inducers. We applied a feature selection method 
on the meta-learning problems, to get the best set of attributes for each 
problem. The performance of the system is significantly improved. 



1 Introduction 

One of the most difficult problems in the machine learning field is that of the 
appropriate selection of a classification algorithm for a specific classification task. 
As it is known from the various NFL theorems there is no classification 
algorithm that is superior over all the others for all the possible classification 
problems. 

In the past the problem was tackled mainly through the use of measures 
that describe properties of the dataset on which the classification task is to be 
performed. These dataset measures, along with performance measures of the 
classification algorithms, are used as data to (meta-)learn models that guide the 
application of the classification algorithms. 

There has been a substantial amount of research effort that follows the pre- 
vious mentioned approach, using different sets of dataset characteristics, and 
different ways of constructing the meta-level problems and the meta-models. 
Nevertheless little attention has been given to the usefulness and the discrimi- 
nating power of the dataset characteristics used, with the only notable exception 
of p2b| . In this paper we examine the met a- feature selection problem in the con- 
text of pairwise comparisons between base-level learning algorithms. We believe 
this setup gives us a better insight into the usefulness of each characteristic. Our 
goal is twofold: first to improve the overall performance, and second to acquire 
a better understanding of the way the dataset characteristics affect the relative 
performance of the algorithms. 
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2 Background 

There is a huge amount of literature -both theoretical and empirical- on model / 
algorithm selection for a specific classification task. We restrict our attention to 
empirical work, and more specifically to approaches that handle the problem of 
model selection as a learning problem on the meta-level (meta-learning) . Here the 
goal of meta-learning is the construction of meta-models that associate dataset 
descriptions with algorithm performance. 

The idea of meta-learning appears for the first time, in a very simple form, 
in the work of Rendell et al, Meta-learning regains attention as a byproduct 
of the STATLOG project ^7|. STATLOG’s main goal was the study of the 
performance of a variety of classification algorithms on different problems and the 
acquisition of knowledge on what works well and where. To this end, they tried 
to explain the performance of the algorithms with respect to dataset properties. 
These properties included mainly statistical and information based measures 
taken on the datasets. They went one step further and automated the process 
by constructing meta-learning problems, one for each classification algorithm. 
Algorithms were characterized as applicable or non-applicable for each dataset, 
and it was that characterization that they tried to predict and explain using the 
dataset properties, (see 0, for a thorough description). 

The approach followed in STATLOG had two main limitations. First a 
dataset’s properties were mainly means of statistical and information theoret- 
ical measures computed overall its attributes, due to the restrictions imposed 
by propositional learning. Second for a given dataset, algorithms were charac- 
terized only as applicable or non-applicable, i.e. they do not provide a way to 
rank the algorithms; furthermore, that characterization was based on a simple 
comparison of accuracies devoid of any statistical significance test. 

Kalousis and Theoharis in m tried to overcome those limitations. To al- 
leviate the problem of the constraints imposed by propositional learning they 
introduced histograms to describe the distributions of the measures computed 
for each attribute. To provide a ranking of the algorithms and to statistically 
control it, they created ( 2 ) meta-learning problems, n being the number of al- 
gorithms. Each meta-learning problem concerns a specific pair of algorithms. 
For every dataset in order to determine which algorithm from a pair was bet- 
ter, a statistical test was employed. This formulation of meta-learning problems 
provides great flexibility. It allows the analyst to focus on specific pairs of al- 
gorithms and handle each pair in a different way, e.g. by applying a different 
learning algorithm on the meta-level, or even by using a different set of discrim- 
inating features. In short the power of the approach comes from the possibility 
of combining different meta-learning models. 

A very elegant way to make full use of the information contained in the 
statistical and information theoretical measures is inductive logic programming, 
which overcomes the representational limitations of propositional learning 122 ! 

Soares and Brazdil I2ni provide a method for ranking, not only in terms of 
accuracy, but in terms of the accuracy-time tradeoff. The ranking is produced 
from a ratio of accuracy over time estimated from the performance of the clas- 
sifiers averaged on a number of datasets. Their approach is similar to the DEA 
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approach (Data Envelopment Analysis) CHI which also takes into account vari- 
ous performance measures of the algorithms. Both approaches are very effective 
rankers, but when they are used in a meta-learning context their flexibility di- 
minishes. To produce a ranking for a new dataset, a set of similar datasets has 
to be examined (similar in terms of the dataset properties). This imposes the 
use of a single and global meta-learning model, which may only be based on a 
k-nearest neighbor algorithm. Furthermore as it has been shown in a previous 
study a nearest neighbor algorithm is not necessarily the best choise, as the 
inducer of the meta-level. 

A new approach to dataset characterization appeared in Phahringer et al 
m Instead of using statistical and information based measures to describe the 
datasets, they use the performance of very simple learning algorithms, which 
they call landmarkers. A crucial factor here is the appropriate selection of the 
landmarkers, which should cover a wide range of learning biases, and at least be 
representative of the learning biases of the base learners. The results presented 
in that paper seem quite promising. 

3 Meta-feature Selection 

A dimension that has received little attention, if any, in the meta-learning field, 
is the explanation and the understanding of the factors that affect inducer per- 
formance. All previous efforts have aimed at maximizing the predictive capa- 
bilities of the meta-learner without understanding the factors (i.e. properties of 
the datasets) that affect the performance of the algorithms. Applying feature 
selection to the meta-level can cover this gap and at the same time improve 
meta-learning performance. Using feature selection we can have a better idea of 
the factors that affect the performance of the learners. This is especially true 
when the meta-learning algorithm used is an instance based learner, which gives 
no insight into the relevance of the atributes used for learning. 

The first attempt at meta-feature selection appeared in the meta-learning 
framework of zooming-ranking m- As it was mentioned previously the main 
limitation of this framework, is the mandatory use of a single and global meta- 
learning model (an instance based model). However the factors that determine 
the relative performance of a group of algorithms, may be quite different from the 
factors that determine the relative performance of another group of algorithms. 
It is exactly this case that calls for use and combination of different meta-learning 
models. Here the diversity of the meta- models comes from the use of different 
sets of meta-features (i.e. dataset characteristics). This is where we can make 
full use of the flexibility of the meta-learning framework of cni. By applying 
feature selection to pairwise comparisons of learning algorithms we get different 
sets of meta-features which will give rise to different meta-learning problems and 
finally to different meta-learned models. 

4 Feature Selection 

Two are the main ways that feature selection is performed in machine learning, 
the filter and the wrapper approach m- In the filter approach, solely proper- 
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ties of the datasets are used in order to perform the selection of the features. 
These properties could be measures of association between features, measures of 
distance or dependence. 

In the wrapper approach m , the driving force is the accuracy of the learning 
algorithm that is going to be applied on the dataset. An extensive and systematic 
search is performed in the state space of all the possible feature subsets using 
heuristic search methods, like hill climbing, simulated annealing or best first. 
The search can begin either from the full set of features {backward elimination) 
or from the empty set {forward selection). The feature selection algorithm con- 
ducts the search using the estimated accuracy of the induction algorithm as the 
evaluation function. At the end, the feature set achieving the highest accuracy 
is selected. 

In this study we have chosen to use the wrapper approach to perform the 
feature selection on the meta-level. Although it requires a substantial amount of 
computational time, in the case of meta-learning this factor is not so important, 
since it will only be perfrormed once. 



5 Experimental Setup and Results 

We used a variety of learning algorithms as the base learners whose relative per- 
fromance we try to predict. An orthogonal decision tree inducer from Quinlan’s 
C5.0 (c50tree), an oblique decision tree inducer Ltree |S|, two rule inducers- 
Ripper jS] and the rule version of C5.0 (c50rules), a linear discriminant (lindiscr), 
a boosting algorithm from C5.0 (c50boost), an instance-based learner (IBL), and 
Naive Bayes (NB) (the last two from the MLC-I— I- library |1 dj). To perform fea- 
ture selection we used MLC-I— I- feature selection capabilities. We started the 
search from the full set of characteristics, thus using backward elimination. The 
search strategy used was best first search. The evaluation function for the quality 
of a state in the search space (i.e. a subset of features) was the accuracy of the 
instance-based inducer as it is estimated by 10-fold cross validation. 

We used 1082 datasets, including benchmarks from the UCI repository m 
as well as artificial datasets generated by modifying the former. The modified 
datasets were produced in the context of two big scale studies designed to explore 
the behavior of the learning method used in response to two additional dataset 
deficiencies, namely missing values and irrelevant attributes. For a complete 
description of the way that those datasets were created see m- 

Since one of the main goals is to predict which inducer(s) to use, we measure 
performance in terms of predictive accuracy. Accuracy is estimated not only for 
the final suggestions, but also for the individual meta-models extracted from 
the meta-learning problems, since these critically affect the performance of the 
whole system. We give results both for meta-models that have been created with 
feature selection and for meta-models that have been created with no feature 
selection, and compare the two approaches. 
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5.1 Dataset Characteristics 

By dataset characteristics we mean a set of structural characteristics of a dataset 
that jointly determine the performance of an inducer when applied to it. They 
constitute the attributes of the meta-learning problems. The full set of char- 
acteristics is presented in Tabled Characteristics 8 to 11 give the maximum, 
minimum, mean and standard deviation of distinct values for the nominal at- 
tributes. The concentration coefficient is a measure of association between nomi- 
nal attributes P| . The non computable features are associated with pathological 
cases where for various reasons the corresponding measure cannot be computed. 
Binary attributes is the number of binary attributes obtained, when all nominal 
attributes are represented with binary encoding. Canonical correlation is the 
first canonical correlation between a linear combination of the class variable and 
a linear combination of the attributes. Fract is the proportion of total variation 
that is explained by the first canonical discriminant. Equivalent number of at- 
tributes is the number of attributes required to describe the class, assuming that 
they all have the same mutual information with the class, that equals to the 
mean mutual information. Noise to signal ratio is a rough indication of the noise 
contained in the dataset. Mean multiple correlation denotes the average correla- 
tion between each attribute and a linear combination of all the other attributes. 
Finally SDratio is a measure of the homogeneity of the covariance matrices of 
the different classes. A complete description of the characteristics can be found 
in jlUllY) . 



Table 1. Dataset Characteristics. 



1 


# classes 


45 


non comput. correl. hist. 


2 


# attributes 


46..55 


missing values histogram 


3 


if instances 


56,57 


if continuous ff nominal 
if attributes ’ if attributes 


4 


if attributes 
instances 


58 


Binary Attributes 


5 


if unknown values 


59 


Fract 


6 


if unknown values 


60 


Canonical Correlation 


a attributes * if instances 


7 


if nominal attributes 


61 


Mean Skew 


8.. 11 


max, min, mean, stdv of nominal 
attribute values 


62 


Mean Kurtosis 


12. .21 


concentration histogram 


63 


Class Entropy 


22 


non computable cone, histogram 


64 


Mean Attribute Entropy 


23. .32 


concentration histogram with class 


65 


Mean Mutual Information 


33 


non comput. cone. hist, with class 


66 


Equivalent number of attributes 


34 


if continuous attributes 


67 


Noise to signal ratio 


3S..44 


correlation histogram 


68 


Multi Attribute Correlation 






69 


SDratio 
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Table 2. Class Distributions for each of the meta-learning problems. 



(algo-x, algo-y) pairs 


algo-x 


algo-y 


tie 


majority 


c50rules c50boost 


4.47% 


37.40% 


58.14% 


58.14% 


c50tree cSOboost 


3.91% 


38.33% 


57.77% 


57.77% 


cSOtree c50rules 


13.02% 


13.95% 


73.02% 


73.02% 


lindiscr c50boost 


3.63% 


64.47% 


31.91% 


64.47% 


lindiscr c50rules 


12.09% 


52.65% 


35.26% 


52.65% 


lindiscr c50tree 


10.98% 


54.70% 


34.33% 


54.70% 


Itree c50boost 


13.21% 


35.16% 


51.63% 


51.63% 


Itree c50rules 


27.07% 


14.23% 


58.70% 


58.70% 


Itree c50tree 


25.21% 


15.26% 


59.53% 


59.53% 


Itree lindiscr 


61.12% 


6.51% 


32.37% 


61.12% 


IBL cSOboost 


1.02% 


64.65% 


34.33% 


64.65% 


IBL c50rules 


10.79% 


49.77% 


39.44% 


49.77% 


IBL cSOtree 


7.91% 


52.00% 


40.09% 


52.00% 


IBL lindiscr 


42.33% 


21.21% 


36.47% 


42.33% 


IBL Itree 


9.30% 


56.65% 


34.05% 


56.65% 


NB c50boost 


2.70% 


60.84% 


36.47% 


60.84% 


NB c50rules 


14.33% 


51.44% 


34.23% 


51.44% 


NB cSOtree 


12.09% 


52.93% 


34.98% 


52.93% 


NB lindiscr 


37.30% 


21.58% 


41.12% 


41.12% 


NB Itree 


5.58% 


58.23% 


36.19% 


58.23% 


NB IBL 


29.12% 


34.79% 


36.09% 


36.09% 


ripper c50boost 


1.58% 


50.98% 


47.44% 


50.98% 


ripper c50rules 


8.09% 


32.00% 


59.91% 


59.91% 


ripper cSOtree 


2.70% 


37.21% 


60.09% 


60.09% 


ripper lindiscr 


46.14% 


17.02% 


36.84% 


46.14% 


ripper Itree 


3.44% 


43.35% 


53.21% 


53.21% 


ripper IBL 


34.42% 


19.44% 


46.14% 


46.14% 


ripper NB 


41.30% 


16.93% 


41.77% 


41.77% 


mean 








54.14% 



5.2 Results on the Meta- learning Problems 

From the 8 available learning algorithms a total of ( 2 ) = 28 meta-learning prob- 
lems were created. Tabled gives the distribution of classes (algo-x, algo-y, tie) for 
each of these problems. The last column specifies the percentage of the majority 
class; this will serve as the baseline or default accuracy against which to evalu- 
ate the accuracies estimated by the learned meta- models. If their performance 
is deemed acceptable, these models can then be used to provide a ranking of 
inducers for new datasets. 

We used 10-fold cross-validation to measure the accuracy of the instance- 
based models generated for each of the meta-learning problems. For both cases 
(i.e. no feature selection, feature selection) we get a significant increase over 
the mean default accuracy. The improvement is 27.25% without and 30.40% 
with feature selection (Table OJ. We compare the accuracy achieved with and 
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without feature selection, for each meta-learning problem, using a McNemar 
test of significance. The results are given in table 01 As we can see performance 
improves, by an average of 3.15%, when feature selection is performed. The 
difference in accuracy is statistically significant for all pairs of inducers. 

Table El shows which characteristics were selected for some of the meta- 
learning problems. It is clear that the set of discriminating characteristics 
changes for different pairs of algorithms. Examining the selected characteris- 
tics for each pair of inducers we can only draw conclusions as to which factors 
impact their relative performance. However we cannot explain how these factors 
determine that relationship, i.e. what are the values of the characteristics for 
which it is better to use one inducer instead of another. To get a quantitative 
description of how the dataset characteristics determine inducers’ superiority we 
plan to use a different meta-learning algorithm. Possible selections are inducers 
that produce a model of the classification, e.g decision trees or rule inducers. 

Another way to examine the results is to explore the ’total’ discriminating 
power of each dataset characteristic, that is how often it gets selected over all 
the meta-learning problems. Table El shows the relative frequency with which 
each feature is selected. The most often selected attribute is the noise to signal 
ratio, present in 25 of the 28 meta- learning problems. Although a rough approx- 
imation, since it is based on the mean attribute entropy and the mean mutual 
information between the attributes and the class, it is quite useful in determining 
the relative performance of the algorithms. The correlation histogram is another 
noteworthy characteristic: one of its bins shares the noise-signal ratio’s extremely 
high selection rate, and eight of the others are above the 50% level. This seems to 
indicate the non negligible influence of correlated attributes on learning, due to 
varying degrees of sensitivity exhibited by the learners. Following is the ratio of 
the number of attributes to the number of instances. It is known that increasing 
the number of features beyond a certain point is likely to be counterproductive 
0. The number of classes is also selected very often, providing an indication that 
inducers react differently to variations in the number of classes. Information the- 
oretical measures such as mean mutual information and mean attribute entropy 
also appear to be discriminating features considering their high selection rate. 
The only characteristic that seems to be completely useless is the histogram of 
missing values, none of its elements is ever selected. This characteristic describes 
the distribution of percentages of missing values of the attributes. Overall the 
discriminating power of each characteristic is quite different, and with the no- 
table exception of the missing values histogram, all of them are used at least in 
one pair of base-inducers. 



5.3 Results on the Final Suggestions 

To measure the performance, in terms of the final suggestion, we first derive the 
true ranking of the inducers for each dataset, using the McNemar test. This give 
us, at least theoretically, a possible number of 2®— 1 = 255 different combinations. 
In practice, in our study, only 80 of those appear. The inducer that gets the top 
position most often is c50boost, (26.23%). In Table 0 we give the distribution 
of inducer(s) that get the top ranking in more than 1% of the total number of 
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Table 3. Accuracies and significance levels for each of the 28 meta-learning problems. 



pair 


no feature selection 


feature selection 


significance level 


cSOrules cSOboost 


82.42% 


84.65% 


0.01640 


cSOtree cSOboost 


80.00% 


82.79% 


0.00011 


cSOtree cSOrules 


78.14% 


84.47% 


0.00000 


lindiscr cSOboost 


84.84% 


88.28% 


0.00002 


lindiscr cSOrules 


85.12% 


88.00% 


0.00030 


lindiscr cSOtree 


85.58% 


88.84% 


0.00013 


Itree cSOboost 


78.60% 


80.47% 


0.00721 


Itree cSOrules 


82.14% 


83.91% 


0.00395 


Itree cSOtree 


78.79% 


81.86% 


0.00044 


Itree lindiscr 


81.21% 


84.65% 


0.00003 


IBL cSOboost 


85.40% 


87.91% 


0.00305 


IBL cSOrules 


74.88% 


80.56% 


0.00000 


IBL cSOtree 


79.53% 


82.14% 


0.00125 


IBL lindiscr 


81.49% 


84.84% 


0.00000 


IBL Itree 


74.60% 


80.47% 


0.00000 


NB cSOboost 


85.02% 


87.44% 


0.00001 


NB cSOrules 


84.74% 


88.19% 


0.00034 


NB cSOtree 


84.47% 


87.91% 


0.00008 


NB lindiscr 


77.95% 


81.58% 


0.00013 


NB Itree 


84.37% 


86.60% 


0.00028 


NB mlcibl 


85.02% 


86.42% 


0.02497 


ripper cSOboost 


85.02% 


85.02% 


1.00000 


ripper cSOrules 


77.86% 


83.81% 


0.00000 


ripper cSOtree 


84.09% 


86.60% 


0.00175 


ripper lindiscr 


78.70% 


81.40% 


0.00034 


ripper Itree 


75.81% 


78.70% 


0.00016 


ripper IBL 


81.30% 


85.02% 


0.00003 


ripper NB 


81.67% 


84.56% 


0.00007 


means 


81.39% 


84.54% 




improvement 


27.25% 


30.40% 





the datasets. The next step is to perform 10-fold cross validation on the 1082 
data sets that were used. In each step of the cross validation process we build all 
the meta-models using 90% of the datasets. The meta-models are then applied 
to the remaining 10% and the proposed inducer(s) is (are) compared to those 
which were actually in the first on the basis of McNemar tests. To estimate the 
accuracy we do not use a 0/1 loss function. Instead a suggestion is considered 
successful when the proposed inducer(s) is (are) a subset of the true set of the 
top inducers as determined by the McNemar tests. 

In Table 0 we see the accuracy that the instance-based inducer achieves on 
the final suggestion of base level inducers. Accuracy without feature selection is 
69.67% and increases to 76.28% with feature selection. The p- value of a binomial 
test of significance is 0, which means the difference is significant at any level of 
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Table 4. Groups of inducers that were ranked at the top for more than 1% of the 
datasets. 



Group 


Frequency 


Percent 


cSOboost 


282 


26.23 


Itree 


140 


13.02 


cSOrules c50tree Itree 


S4 


S.02 


cSOrules c50boost cSOtree 


47 


4.37 


cSOrules cSOboost cSOtree lindiscr Itree IBL NB ripper 


47 


4.37 


lindiscr 


41 


3.81 


NB 


30 


2.79 


cSOrules cSOboost cSOtree Itree ripper 


29 


2.7 


cSOboost Itree 


29 


2.7 


cSOrules cSOboost cSOtree Itree IBL ripper 


27 


2. SI 


IBL 


27 


2. SI 


cSOrules cSOboost cSOtree Itree 


2S 


2.33 


cSOrules cSOboost 


23 


2.14 


cSOrules cSOtree 


22 


2.0S 


cSOrules 


20 


1.86 


Itree NB 


14 


1.3 


cSOboost IBL 


14 


1.3 


cSOboost cSOtree 


13 


1.21 


lindiscr Itree 


12 


1.12 


cSOtree 


11 


1.02 


cSOrules cSOboost cSOtree lindiscr Itree NB ripper 


11 


1.02 



Table 5. Accuracy of the final suggestion with and without feature selection. 



no feature selection 


feature selection 


69.67% 


76.28% 



significance. To conclude, feature selection significantly improves performance 
with respect to the advice of the base inducer. 



6 Conclusions and Future Work 

Here we have continued our work on a previous system for inducer selection. 
The goal was twofold. First to gain a better understanding of the factors that 
determine the relative performance of inducers. Second to improve the perfor- 
mance. The meta-learning framework adopted provides great flexibility and the 
possibility to examine the factors that determine the relative performance of 
each pair of inducers. 

Examining the meta-models created for each pair of inducers we saw that 
the factors determining the relative performance of inducers vary from pair to 
pair. The new subset of features for each pair of inducers not only improves the 
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Table 6. Characteristics selected for three of the meta-learning problems, 1 indicates 
selection of the corresponding characteristic, 0 elimination. The frequency column gives 
the frequency with which the characteristics appear in the different meta-learning mod- 
els. 



Attribute 


IBL NB 


IBL Ltree 


NB CSOboost 


frequency % 


^ classes 


1 


1 


1 


82.21 


^ attributes 


1 


0 


0 


60.71 


# instances 


1 


0 


1 


46.42 


# attributes 
^instances 


1 


1 


0 


85.71 


^ unknown values 


1 


0 


0 


25.00 


# unknown values 
-#■ attributes * 4 ^ instances 


1 


1 


1 


60.71 


44 - nominal attributes 


1 


1 


1 


71.42 


max, min, mean, stdv of nominal 
attribute values 


1101 


1010 


1010 


54.14, 25.00, 64.28, 
64.28 


1..10 concentration histogram 


1101110000 


0101000000 


1010111111 


78.57, 78.57, 46.42, 
42.85, 50.00, 35.71, 

28.57, 21.42, 21.42, 
21.42 


non computable cone, his- 
togram 


0 


0 


1 


21.42 


1..10 concentration histogram 
with class 


1001000000 


0000000000 


1101000000 


53.57, 57.15, 3.57, 

78.57, 3.57, 3.57, 

3.57, 3.57, 0, 0 


non computable cone, his- 
togram with class 


0 


0 


0 


3.57 


44 continuous attributes 


1 


0 


0 


50.00 


1..10 correlation histogram 


1111111001 


1111111111 


1111111111 


60.71, 89.28, 64.28, 
50.00, 60.71, 75.00, 
60.71, 53.57, 35.71, 
53.57 


non computable correlation 
histogram 


0 


0 


1 


14.28 


1..10 missing values histogram 


0000000000 


0000000000 


0000000000 


0, 0, 0, 0, 0, 0, 0, 0, 
0, 0 


44 continuous 
44 attributes 


0 


0 


0 


7.14 


# nominal 
44 attributes 


0 


0 


1 


14.28 


Binary Attributes 


0 


1 


1 


32.14 


Fract 


1 


0 


0 


28.57 


Cancor 


1 


1 


1 


64.28 


Mean Skew 


1 


1 


1 


50.00 


Mean Kurtosis 


1 


1 


1 


64.28 


Class Entropy 


0 


0 


1 


50.00 


Mean Attributes Entropy 


1 


1 


0 


78.57 


Mean Mutual Information 


1 


1 


1 


75.00 


Equivalent number of at- 
tributes 


0 


0 


1 


75.00 


NoiseSignal Ratio 


1 


1 


1 


89.28 


AttrMultiCorrel 


1 


1 


1 


67.85 


SDratio 


0 


1 


0 


64.28 



performance but also provides a better understanding of what is relevant and 
what is not. 

Based on the, mainly empirical, work presented here, our next steps will be 
to get a better understanding of the specific conditions under which an inducer 
is a better choice than another one for a specific dataset. This could be achieved 
through the use of a more sophisticated inducer on the meta-level that actually 
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constructs learned models. The thorough examination of the models will give a 
better insight on how data characteristics affect the relative performance of the 
algorithms and will lead to a better understanding of the weak and strong points 
of each inducer. 

Acknowledgments. We would like to thank Johann Petrak for his Perl scripts 
and Joao Gama for Ltree and his implementation of linear discriminants. This 
work has been supported by Swiss OPES in the framework of ESPRIT IV LTR 
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Abstract. It is widely recognized that successful businesses usually 
fall into set routines and become limited by their past. To remain 
successful, they need to discover new opportunities and niches. Niches 
are surprising rules that contradict the set routines’, they capture sig- 
nihcant, representative client sectors that deserve new, more prohtable 
treatments; they are not merely strong-rule and exception pairs. In 
this paper we study the efficient mining of set routines and niches. We 
also introduce a semantic approach to select a set of representative 
patterns, and present an efficient incremental algorithm to implement 
the approach. 

Keywords: Data mining, niches, set routines, exceptions, interesting- 
ness, semantic-based selection 



1 Introduction 

In order to succeed, a starting business is always looking for opportunities. An 
established business, however, usually falls into set routines and becomes lim- 
ited by its past. To remain successful, it needs to discover new opportunities and 
niches. Niches are surprising rules that contradict the set routines; they capture 
significant, representative client sectors that deserve new, targeted, more prof- 
itable treatments; they are not merely strong-rule and exception pairs. Niches 
are also useful for many other applications such as medicine, scientific discovery, 
and customized treatment of clients. 

We illustrate the importance of niches with an example reported in KD- 
Nuggets |S|. “Farmers Insurance found a previously unnoticed niche of sports 
car enthusiasts: married boomers with a couple of kids and a second family car, 
maybe a minivan, parked in the driveway. Farmers relaxed its underwriting rules 
and cut rates on certain sports cars for people who fit the profile (and presumably 
gained market share in this niche).” An insurance company divides clients into 
different risk classes, and charges rates according to risk levels. Rules for decid- 
ing risk levels are formulated through experience, statistical analysis, or data 
mining; these rules may remain fixed for a long time and become set routines. 
A niche here can be a special segment of customers who are less risky than the 
company currently believes. 

To mine niches, we need to capture the set routines first. For a business 
decision, the set routines (SRS) should correspond to a set of business rules or 
operational policies; in general, the SRS should correspond to a set of dominant 
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trends (DTs) which are important to the task (decision) at hand. A DT can 
be captured by an emerging pattern (EP), namely a pattern which occurs more 
frequently in the undesirable instances than in the desirable instances, or vice 
versa. The EP should occur at a relatively high support. This corresponds to 
the fact that the SRS is usually formed by past experience, observations, and 
even data analysis, because human observations and past data mining algorithms 
mainly discover high support patterns. 

The SRS should contain a relatively small number of DTs, such as 100 or 
less. More importantly, the DTs should represent different segments of the clients 
(instances); each DT should capture a unique segment of the instances. We will 
first mine a set of EPs, using ConsEPMiner uni. Then we use a novel semantic- 
based approach, which minimizes overlap and maximized disjointness between 
DTs, to select an SRS from this set. We associate each pattern with the set of 
data instances containing the pattern, and we consider two patterns semantically 
similar if their associated data sets overlap sufficiently. Semantic-based selection 
ensures that different DTs capture almost disjoint segments of data, and the 
DTs in the SRS collectively cover as much data as possible; it also helps enhance 
understandability of the niches and avoid repeated computation. 

The exception EPs to the SRS are mined, again using ConsEPMiner, and 
the semantic-based approach is used to select good representatives as niches. 
Experiments show that our algorithm is efficient, can find meaningful niches 
from real data, and can succeed in niche mining at low support levels. 

To illustrate how niche mining can be done, consider an auto insurance com- 
pany which has been in operation for a while. If an SRS cannot be extracted from 
the business manual, we simply use the company’s database, using a threshold 
on rates to divide the clients into Risky'"'’™ and NonRisky"'™ (according to the 
view of the company). We extract an SRS from Risky'"'™ and NonRisky"'™ . 
Then we divide the clients into Risky™^""’’ and N onRisky°‘"^'"‘°‘\ using a thresh- 
old on the total number (or amount) of claims. We can mine the exceptions to 
the SRS using these two new data sets. 

Niche mining can also be useful to a new insurance company, to help it un- 
derstand how its competitors work. If the new company has access to all the data 
discussed above, then it can just use the procedure discussed above. Otherwise, 
it can still gain insights on how its competitors work, by using Risky""*™^ and 
N o-nRisky""*""^ for both SRS mining and niche mining. 

Most past work on data mining concentrated on discovering high-support 
patterns or high-support relationships. In contrast, our work is concerned with 
finding low support exceptions to high support strong rules. Moreover, our niches 
are exceptions to the set routines of an organization and thus are exceptions in a 
global sense, and can thus provide the organization with “complete” representa- 
tives of niche opportunities. In contrast, past research on exception mining PI 
cm considered strong-rule and exception pairs in a local manner. Producing all 
strong-rule and exception pairs will make the results hard to understand; more- 
over, the mining process also becomes unnecessarily expensive, because repeated 
similar computation is performed for similar strong rules. Other researchers have 
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considered mining of exceptions of other types, e.g. interesting holes in data jH|. 
Our work is related to interestingness of patterns ilHl2|2fcJ. 

Our semantic-based selection of representative patterns is related to m 
That paper used the semantics-based idea of “cover” to select patterns, but it 
did not consider minimizing overlap and maximizing disjointness. 

Our method can also be used to find semantic representatives of strong-rule 
and exception pairs. Structures can be added to the DTs in an SRS, e.g. by 
imposing an ordering; mining of niches for such extensions will be a future topic. 



2 Set Routines and Niches 

The way an organization operates can be understood from its past operations. 
We will consider one decision (e.g. is a client risky or not risky) for the organi- 
zation. For each instance, the decision will be either Yes or No. We will call each 
of the two decisions a class, and will denote these two classes as P (positive) and 
N (negative). For C in {P, N}, we will use NC to mean the opposite class of C. 
For convenience, we will identify a class with its associated set of instances. 

We assume the readers are familiar with transactions, relations, itemsets, and 
supports of itemsets. We will use the term instance to refer to case, transaction, 
vector, or tuple. We assume that numerical attributes have been discretized using 
some binning method such as equal-density or equal-length, and numerical values 
have been mapped to their containing intervals. For uniformity, we will also refer 
to these bins as values. Relations can now be viewed as transactions, where an 
item is an attribute-value pair. 

For any organization, the policies regarding the decision are usually formu- 
lated under the influence of dominant trends (DTs). There are several require- 
ments on a DT: (a) It should be a pattern (condition), (b) It should capture a 
significant segment of the instances, (c) It should occur much more frequently 
in one class than in another class. 

(a) ensures that the DT can be tested on instances, (b) and (c) ensure that 
the DT has influenced the formulation of policies of the organization, because it 
differentiates between the two classes over a significant segment of instances. 

For example, for insurance, statistics shows that sports car owners are usually 
riskier than other owners. The condition involved in this pattern is “the owner 
ownes sports cars” . This condition captures a significant segment of clients. 

An emerging pattern (EP) is a pattern meeting these requirements of a DT. 
EPs were introduced 0 to capture sharp differences between data classes or 
emerging trends in time. Suppose we are given two classes, say classes Ci and 
C2, associated with respectively data sets D\ and D2. Let suppi{X) denote 
suppjj.pC). The growth rate of an itemset X from Di to D2 is defined as the ra- 
tio (letting 5 = 0 and ^ = 00). Given a growth rate threshold p > 1 , 

if the growth rate of X from Di to D2 is > p, then X is called an emerging 
pattern from Di to D2 (or from Ci to C2, or simply an EP of class C2); Ci 
is called the background class and C2 the target class; we will write X : C2 to 
signify the fact that X is an EP of class C2. We can use EPs to make decisions. 
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li X : C 2 is an EP with a growth rate of 40, \Di\ k, |I? 2 |, and t is an instance 
containing X, then the probability that t belongs to class C 2 is 97% = 45 !^ • 
The set routines (SRS) of an orranization should consist of a set of DTs. 
They should satisfy these constraint^ in order to provide an accurate model: 

(a) The DTs in the SRS should collectively cover as many instances as possible. 

(b) Different DTs in the SRS should cover disjoint subsets of instances, (c) The 
SRS should be relatively small, containing perhaps < 100 DTs. 

For example, the SRS for a car insurance company should correspond to the 
way the rates are determined. One DT may be {sports-car-owner : yes} : Risky. 
Another DT may be (points-on- license > 3} : Risky. A third DT may be (3.0 < 
GPA < 4.0} : nonRisky. There can be more DTs in the SRS. 

We now formalize the meaning of cover, disjointness and overlap. For each 
pattern X, let Sat{X) denote the set of instances (of either class P or N) sat- 
isfying (containing or covered by) X. (A more refined approach is possible, by 
dividing Sat{X) into two classes and adjusting semantic-based selection accord- 
ingly. To focus on the spirit of the semantic-based approach, we omit the details 
of that refinement here.) For a set S of patterns, let Sat{S) = Ux^sSat{X). 
Difference and overlap between two patterns X and Y will be measured by 
|(5'at(A) - Sat{Y)) U {Sat(Y) - S'at(X))| and |5'at(A) n S'at(y)|, resp. 

Example 1. Suppose our two classes together contain the instances of Table 1 
and we are given the four EPs in Table 2. Then Sat(|1,2}) = {ti,t 2 ,i 3 ,i 6 }) 
Sat({2,3}) = {^ 1 ,^ 5 }, and Sat({{1,2},{2,3}}) = The overlap 

between (1, 2} and (2, 3} is (tij and the difference between them is |t 2 if 3 ji 5 ji 6 }- 



Table 2. EPs 



Table 1. Transactions 



Transaction id 




t2 


t3 


t4 


tb 


te 


t7 


ta 


Items 


{1,2,3} 


{1,2,4} 


{1,2} 


{2,4,5} 


{2,3,4} 


{1,2} 


{4,5} 


{1,4,5} 



EP 

TW 

{2,4} 

{2,3} 

{4,5} 



We now consider how to capture niches. Similar to dominant trends, niches 
should naturally be closely related to the decision under consideration; so we will 
also capture a niche by an EP. A niche should satisfy the following requirements: 
(a) A niche should be an exception to some DT in SRS. It should capture a subset 
of the instances captured by the DT and lead to a decision reversing that of the 
DT. (b) A niche should not be implied by other DTs of the SRS. We say that 
XY : C is implied by an EP Z : C if Z is a subset of XY. 

For the auto insurance example, the EP {sports-car-owner:yes, age:[40..60], 
married:yes, #kids > 2, second-family-car:yes}:NonRisky is a niche. It is an 



^ Syntactical difference is not good for capturing SRS, because syntactically disjoint 
patterns may be semantically similar: {small-car-owner:yes} and {age: [18.. 25]} are 
syntactically different but may cover nearly the same segment of drivers. 
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exception to the {sports-car-owner:yes}:Risky DT and it is not implied by any 
other DT in the SRS. However, the EP {age:[18..25], GPA:[3.0..4.0]}:NonRisky is 
not a niche: Although an exception to the DT {age:[18..25]}:Risky, it is implied 
by the EP {GPA:[3.0..4.0]}:NonRisky, which is in SRS. 

It is sometimes possible to extract the SRS for an organization from its 
operational manual. If that is not possible, an SRS can be mined from operational 
data of the organization; we will address this problem in Section 4. 

3 The Niche Mining Problem and Our Algorithm 

The niche-mining problem is the following: Given two classes P and N, mine 
an SRS (if not given) and niches satisfying a minsupp threshold on DTs. 



Niche Miner (SRS S, classes P, N, minDTsupp) 

(1) if S is empty then 

(2) mine sets El of EPs from P to A and E2 from N to P; 

(3) select an SRS S from El and P2; 

(4) mine exceptions to DTs in S; 

(5) remove implied exceptions; 

(6) select representative exceptions as niches 



Fig. 1. Niche Miner Pseudo Code 



The psuedo code of our algorithm is given in Figure d We first mine EPs 
with growth rate greater than a threshold value (e.g. 5) and remove EPs whose 
growth rates are relatively low (e.g. not among the top 40%). We then select 
DTs that are semantically distinct in terms of their SATs. Finally we mine EPs 
contradicting the DTs and semantically select the niches. 

By selecting an SRS as a semantic representation of all possible dominant 
trends, we avoid the excessive computation needed over a huge number of possi- 
ble DTs. This allows us to mine at lower support thresholds for the niches. Also, 
the resulting niches will be more understandable. 

Mining EPs: We will use GonsEPMiner (Gonstraint Based EP Miner) jTH] 
to mine EPs. The algorithm uses constraints, either explicitly given or inher- 
ently implied by the data/pattern type, to efficiently mine EPs, from large high 
dimensional data sets. It uses an improvement constraint to ensure that a repre- 
sentative grid of EPs (in the complete set-theoretic lattice of all EPs) is returned; 
the set of returned EPs is much smaller than the set of all possible EPs; more 
specifically, given that one EP X is chosen to be returned, then another EP Y 
will not be returned if F is a superset of X but the growth rate of Y is not 
significantly larger (specified by the improvement constraint) than that of A, 
then Y will not be returned. It also uses upper estimates of supports and growth 
rates, obtained from the counts of candidates (regular or lookahead) already 
considered, to prune candidates. Pruning and dynamic reordering of items are 
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performed at three different stages: before counting, after the background data 
set is counted, after both data sets are counted. Using these ideas, ConsEPMiner 
overcomes the problem of combinatorial explosion of candidate itemsets. 
Semantic- based Pattern Selection: In the next section we will propose ef- 
ficient, semantic-based methods to select an SRS and select the representatives 
of exceptions as niches. 

Finding niches efficiently: Given an SRS, we need to mine exception EPs that 
contradict the SRS, and select representative exceptions as niches. To select 
representative exceptions as niches, we also use the SAT-EP-Select algorithm 
discussed in the next section. The exception EPs are required to satisfy some 
appropriate growth rate threshold. Observe that the DTs usually have very high 
growth rates, and large reverse growth rates indicate that the exception EPs are 
significant reverse “trends.” 

Our approach to mine exception EPs is: For each DT X we reduce the data 
sets to contain only the transactions containing X, and then we call ConsEP- 
Miner on the reduced data sets to find EPs that contradict X. This method 
is efficient since the reduced data sets are much smaller than the whole data 
sets. The above process of reducing data sets is called relativization; a similar 
technique was used in jjj for instance based classification. 

A naive method to mine EPs is to call ConsEPMiner once for each class, and 
select niches from these common sets of EPs. This method suffers from several 
problems: Many of the mined EPs will be useless, since they do not contradict 
the EPs in SRS; moreover, to mine exceptions, we need to mine at very small 
threshold levels of support (such as 0.01%). At such low levels ConsEPMiner 
generates a huge number of EPs; it is very expensive (with respect to time) to 
find these EPs and to select from the set of candidate niches. One can improve 
this approach to make it competitive, by properly seeding ConsEPMiner with 
the DTs in the SRS. While this improved approach is not yet implemented, 
we believe that its performance might be comparable with our relativization 
approach. (The relativization approach may be better if the volume of data is 
large, as it becomes in memory for each DT after one pass for relativization.) 

From the support and growth-rate of a DT A, one can determine the max- 
imum support and growth rate for exception EPs of X. This information can 
help avoid useless computation: If there are no exception EPs meeting support 
and growth rate thresholds, then there is no need to call ConsEPMiner for X. 



4 Semantic-Based Pattern Selection 

When we mine a data set, we get many, perhaps millions of, EPs. The set of EPs 
can still number in tens of thousands, even after removing those with relatively 
low supports and growth ratefl The objective of semantic-based selection is 
to select a representative subset of EPs satisfying: two different EPs capture 

^ While we present the semantics-based selection algorithm for EPs, it can be easily 
modified to work for other types of patterns. 
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disjoint sets of instances whereas the selected EPs collectively cover as many 
instances as possible. 

The exhaustive approach to selection is clearly infeasible, since the numbers 
of EPs and of transactions are both very large. In this paper we consider the 
greedy method sketched in Figure 0 we will give efficient algorithms for the key 
steps in Section 5. For steps 2 and 5, we use growth-rate to break ties. In step 



SAT-EP-Select(data set D, EP set V, K) 

;; K is the maximum number of selected EPs 
;; returns a set S of selected EPs 

1) compute Sat(X) for all X in V\ 

2) select an EP X from V with highest support; 

3) let S = {X} andP = P-{X}; 

4) while (Sat(S) can be expanded) or (|S| < K) 

5) select an EP X s.t. | 5 °t(x)nSat(S)| maximal for all EPs in V', 

;; if 3X € P s.t. Sat{X) n Sat{S) = 0, choose X 

6) let S = S U {X} and P = P - {X}; 

7) return S; 

Fig. 2. Sketch of SAT-EP-Select 

5, maximizing |Sat(X) - Sat(S)| to |Sat(X) fl Sat(S)| allows us to maximize 
disjointness and minimize overlap. 

Example 2. We illustrate this algorithm using the transactions and EPs of Ex- 
ampleQJ Let iL = 3. Initially, V = {{1, 2}, {2,4}, (2, 3}, (4, 5}}, and the Sat^ are: 
5'at({l,2}) = Sat{{2,4}) = Sat{{2,3}) = and 

^otdd, 5}) = {t 4 ,tT,ts}. We choose {1,2} as our first DT, since it has highest 
support; now S = {{1, 2}} and P = {{2, 4}, {2, 3}, {4, 5}}. 

For each iteration, we need to compute, for each EP X in P, Sat{X) — Sat{S) 
and Sat{X) fl Sat{S) in order to find their cardinalities. For iteration 1, we get 

|S'af({2,4}) - Sat{S)\ = 2, |S'o<({2,4}) n Sat{S)\ = 1; 

|S'af({2,3}) -S'at(S')| = 1, |S'at({2, 3}) n S'at(5)| = 1; 

|S'af({4, 5}) - Sat{S)\ = 3, |S'a<({4, 5}) n Sat{S)\ = 0. 

We choose X = {4, 5}, since it is the only EP whose SAT is disjoint from Sat(S). 
Now, S={{1, 2}} U {{4, 5}} = {{1, 2}, {4, 5}}, and P = {{2, 4}, {2, 3}}. 

For Iteration 2: We get |S'at({2,4}) — Sat{S)\ = 1, |S'a<({2,4}) fl Sat{S)\ = 
2, |S'at({2,3}) — Sat{S)\ = 1, and |S'at({2,3}) fl Sat{S)\ = 1. We choose 
X = {2,3}. Since [S’! = K (and Sat{S) = {ti, ^ 2 , ^ 3 , ^ 4 , ^aPePrPs} happens 
to be equal to the entire transaction set), we stop. So the selected EP set is 
{{1,2},{2,3},{4,5}}. 

The SAT-EP-Select algorithm as given above may select many more EPs 
for one class than the other. One can avoid this as follows: We choose EPs by 
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switching between the two classes; at any time, if one class C is over represented, 
then we select the next EP from the class NC (unless NC is exhausted already). 

There are two expensive steps in this algorithm. Notice that V is usually 
a very large set, containing tens (or even hundreds) of thousands of EPs, the 
data set is also very large, containing tens of thousands of transactions, and the 
data may have very high dimension. Step 1, which computes the initial Sats of 
all EPs in V, can be expensive, since it must check which transactions contain 
which EPs. We implemented this search using a hash tree, obtaining a speed up 
of about 50%; the details are omitted. Step 5, which calculates Sat{X) — Sat{S) 
and Sat{X) fl Sat{S) for each X in P at each iteration, can be very expensive, 
since Sat(X) can be as large as the number of transactions. We will discuss an 
incremental approach to reduce the cost next. 

5 Incremental Computation of Overlap and Differences 

The main idea of our technique is to avoid the repeated computations, between 
consecutive iterations, in computing \Sat{X) — Sat{S)\ and \Sat{X) fl Sat{S)\ 
for each X in V. To this end we store the set Sat{X) — Sat{S) in a variable 
called SatDif(X). SatDif(X) is initialized to Sat{X) (for S = {}). 

Suppose the EP chosen for a particular iteration is Y, and the set S before 
Y is added in it is Sq. Let S'y = U {E}. Let SatDif(X, 5 * 0 ) denote the value 
of SatDif(X) before Y is added to S, and let SatDif(X, Sy) denote the value 
of SatDif(X) after Y is added to S. 

We observe this: A transaction t in SatDif(X, Sq) will need to be re- 
moved to get SatDif(A, iSy) iff t is in SatDif(E, Sq). This observation can 
be used as follows: In the incremental computation we can use the transactions 
in SatDif(E, Sq) to drive the computation of SatDif(A, Sy). This will improve 
efficiency because SatDif(E, S'g) is normally small, especially after a number of 
iterations have been executed. This idea is formalized in the algorithm below: 

For each t in SatDif(Y) 

For each X in 7^ such that t is in SatDif(A, Sq) 
remove t from SatDif(X); 

We now illustrate how the incremental approach is more efficient using the 
data and EPs of Example E The computation with the incremental approach 
differs from the original naive approach, in that we replace SAT of individual 
EPs by their SatDif, and that we replace Sat(S) by SatDif(Y) for deriving 
SatDif(X). More specifically, the computation in iteration 1 is identical to that 
for the non-incremental approach, as SatDif(X) = Sat(X) for all X at this 
time. In iteration 2, this is no longer the case; for example, Sat({4,5}) is re- 
placed by SatDif({4,5}). The computation of Sat({2,4}) - Sat(S) is replaced 
by SatDif({2,4}) - SatDif(Y) (for Y = {4,5}); such replacement is the main 
reason that the incremental approach is efficient, because SatDif(Y) ={t 4 ,tT,tg} 
is much smaller than Sat(S), and because SatDif(Y) is used to drive the modi- 
fication of SatDif of all EPs. For example, SatDif({2,4}) is computed through 
{t4,t5} - {U,t7,ts} instead of {t 2 M,h} ~ { hM}- 
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We will store the set SatDif(X) as a bit vector, and the SatDifs of all EPs 
in 7^ as a bit matrix, with EPs as columns and transactions as rows. Initially, 
position {t, e) of the matrix is 1 iff transaction t contains EP e. Table 3 below 
shows the initial contents of this matrix for Example 

To compute the cardinalities of the differences and the overlaps, we 
will keep two additional arrays of integers: OldCounts and CurCounts. 
OldCounts(X) stores |S'at(X)| whereas CurCounts(X) stores |«S'at(X) — 
5'at(5')| = |SatDif(X)|. We can derive |S'at(X)nS'at(S')| from OldCounts(X) — 
CurCounts{X). The OldCounts array is initialized but never changed. 

CurCounts(X) is adjusted only when we modify SatDif(X). Suppose Y 
has just been selected. For each t in SatDif(F), if t is in SatDif(X), then we 
remove t from SatDif(X) and decrement CurCounts{X) by 1. 

The algorithm, presented in terms of the SatDif matrix, is given in FigureEl 



;; Suppose Y is chosen as a new DT. 

For each transaction t such that the (t,Y) position of the matrix = 1 
;; transaction t is contained in SAxDlF(y) 

For each EP X / T 

If the (t,X) position of the matrix = 1 then 

change the value to 0 and decrement CurCounts for X; 

;; remove t from SatDif(X) and adjust CurCounts(X) 



Fig. 3. Sketch: Bit-Driven SAT-EP-Select 



We illustrate the algorithm for Y = {1,2}. Suppose the matrix before Y is 
added is given in Table 3. Because ti is a member of SatDif(Y), we check if 
there is any other 1 in the same row; we find that there is a 1 in the column for 
{2,3}; we change it to 0 and decrement CurCounts{{2,3}) by 1. Similar actions 
are taken for t 2 , ts, and tg. After adding Y to S, the matrix and the CurCounts 
array become Table 4. 

Table 3. SATDiFMatrix for Example Q Table 4. SATDiFMatrix for second iteration 



Trans /EPs 


{1,2} 


{2.4} 


{2.3} 


{4.5} 




1 


0 


1 


0 


t2 


1 


1 


0 


0 


*3 


1 


0 


0 


0 


*4 


0 


1 


0 


1 


*5 


0 


1 


1 


0 




1 


0 


0 


0 




0 


0 


0 


1 


*8 


0 


0 


0 


1 


CurCounts 


4 


3 


2 


3 


OldCounts 


4 


3 


2 


3 



Trans /EPs 


{1.2} 


{2,4} 


{2,3} 


{4.5} 


ti 


1 


0 


0 


0 


t2 


1 


0 


0 


0 


*3 


1 


0 


0 


0 


*4 


0 


1 


0 


1 


^5 


0 


1 


1 


0 




1 


0 


0 


0 


t? 


0 


0 


0 


1 


*8 


0 


0 


0 


1 


CurCounts 


4 


2 


1 


3 


Oldcounts 


4 


3 


2 


3 



Figured compares the bit-driven incremental approach against the original 
naive approach, for waveform data with 10000 transactions, 21 attributes, and 
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14623 EPs. For K = = 100, it took 2433 seconds for the naive approach 

but only 86 seconds for the incremental approach. Both approaches use hash for 
initial Sat computation. 



3000 




^ Naive method 
Q Bit Driven approach 



Fig. 4. Cost of SRS Selection 



6 Experiments 

We report results of two types of experiments: (i) results of mining SRS and 
niches from real data sets (mostly from the UCl Repository), (ii) results on 
efficiency and scalability with respect to both volume and dimensions. All ex- 
periments, including those reported in previous sections, are carried out on a 
single node of a multiprocessor system with 195 MHz ip27 processors and a 
shared main memory of 3698 MB. 

SRS and Niches. Our results indicate that our algorithm can find meaningful 
niches from real data, at low support thresholds. 

We report results on the Adult data sei0, which has 14 attributes (6 contin- 
uous and 8 nominal) and 32561 = 24720-1-7841 instances. There are two classes 
defined: 1) People earning <50K (24720 instances) and 2) people earning >50K 
(7841 instances). The instances contain personal information (US Census), in- 
cluding these attributes: Age, Workclass, fnlwgt (presumably final wage tax). Ed- 
ucation, Education-num, Marital-status, Occupation, Relationship, Race, Sex, 
Capital-gain, Capital-loss, Hours-per-week & Native-country; the data set was 
originally used to predict yearly salaries. 

We found the SRS given in Tablel 5. We first used the minsupp = 0.02 
threshold, growth rate threshold of 5, and a growth rate improvement threshold 
of 0.05. We found around 500 EPs for each class. We then selected top 300 EPs 

® Experiments on other data sets, including Musk (40 selected dimensions, 6598 = 
5581 -I- 1017 instances). Waveform (21 dimensions, 5000 = 1657 -I- 3343 instances), 
and Arabidopsis-DNA (35 dimensions, 44484 = 2305 -I- 42179 instances), are not 
included due to space restrictions. 

^ In the tables we will omit the attributes if such omissions will not lead to confusions. 
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for each class. (We found that the Adult data set has very few EPs, unlike the 
other data sets mentioned above.) Then we selected an SRS from these EPs. 

Table 5. SRS for Adult 



DTs 


Supp 


GR 


Class 


age: [-..31.6), Never-married, cg20 


0.31 


18.23 


<= mx 


fwtSO, edu-num:[13..-|-), relnship: Husband, White, Male 


0.25 


25.83 


> 50K 


edu-num:[10..13), Married-civ-s, occup:Exec-mngerial, White, 
HPW:[40.2..59.8) 


0.04 


17.39 


> 50K 


age:[31.6..46.2), edu: Prof-school 


0.03 


17.70 


> 50K 


edu-num:[7..10), Divorced, cg20, HPW:[20.6..40.2), nativ-cntry: USA 


0.14 


17.48 


<= 50K 


age:[31.6..46.2), fwt30, edu: Masters, occup: Exec-managerial, Male 


0.02 


18.45 


> 50K 


wrkclss: Private, Separated, cg20, cap-loss: [-..871.2), HPW:[20.6..40.2) 


0.04 


22.92 


<= 50K 



cg20 denotes cap-gain: [-..19999.8), fwtSO denotes fnlwgt: [-..306769) 



We mined exception EPs and selected niches from them. For the first DT 
given above, we found 18 niches, whose supports range from 0.76% down to 
0.03%. We list three of these below. (We omit the conditions in the DT.) 

Table 6. Niches for First DT of Table 5 



niches 


Supp 


GR 


Class 


edu: Bachelors, relnship: Not-in-family 


0.0076 


674.00 


> 50K 


edu-num:[13. White 


0.0018 


153.00 


> 50K 


relnship: Not-in-family, Amer-Indian-Eskimo, HPW:[20.6..40.2) 


0.0003 


GR:oo 


> 50K 



We note that, for real life data sets, niches of support at 0.01% can still be 
very useful - they may capture a segment of thousands of customers. 
Scalability with respect to volume. To study how our algorithm performs 
when size of data set increases, we choose the Adult data set since it has large 
number of instances, which consists of 32561 instances (14 attributes). We ran- 
domly selected sub-data sets of 15K, 20K, 25K, and 30K instances. We set the 
maximum number of DTs to 50. The graph in Figure 0a shows that our algo- 
rithm is scalable w.r.t. volume, and can find niches from very large data sets. 



— • — Finding Niches 
-*-SRS : ConsEPMiner 
-^SRS : SAT-EP-Select 




Fig. 5. Execution Time v.s. Volume and Dimensionality 



The three curves show the time needed by different components of the algo- 
rithm. “SRS:ConsEPMiner” corresponds to the mining of the initial EPs for the 
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selection of the SRS, “SRS:SAT-EP-Select” corresponds to the selection of the 
DTs from the initial set of EPs, and “Finding Niches” corresponds to finding 
the niches (including the repeated calls to the relativization, ConsEPMiner, and 
SAT-EP-Select procedures) . 

Scalability with respect to Dimensionality. To study how our algorithm 
performs with respect to dimensionality, we generated appropriate data sets by 
mutating the Musk data set as follows. The original Musk data has 166 attributes 
and about 6598 = 5581 + 1017 instances. The dimensionality is high. For any 
given number N, we randomly choose N attributes of Musk. 

However, the number of instances in Musk is small. To increase the number 
of instances, we add instances. We do not want to add exact copies of existing 
instances and we do not want to add instances which are totally different from 
the existing instances. Our solution is to add “mutated copies” of existing in- 
stances: We randomly choose an existing instance, and then randomly choose 
some attributes, and then change these attributes randomly within a range of 
±20% of its original value. For each instance, a maximum of 25% of total number 
of attributes are modified. These ensure that the mutated instances are similar 
to existing ones but not identical. 

The graph in Figure Elb shows performance of our algorithm w.r.t. the num- 
ber of dimensions: it is fast and can efficiently deal with high dimensions. 



7 Concluding Remarks 



In this paper we proposed a way to capture set routines (SRS) and niches, and 
introduced algorithms to efficiently mine SRS and niches. SRS allows one to 
understand in a global sense how an organization operates with respect to a 
decision. By mining niches together with an SRS, we ensure that the niches can 
indeed provide appropriate representatives of all possible new opportunities, and 
that they are informative and more understandable. 

The semantic-based selection algorithm introduced here is useful for selecting 
a good set of representatives of patterns. The approach ensures that different 
selected patterns capture different aspects of the application (equivalently, differ- 
ent segments of data), and collectively they capture as many aspects as possible. 

Algorithmically, our niche and SRS mining algorithm is efficient. An impor- 
tant reason is the selection of an SRS before mining exceptions, because this 
helps avoid the repeated computation for similar dominant patterns. The bit- 
driven SAT-EP-Select algorithm is also efficient. We also used a hash technique 
and a relativization technique to improve the efficiency of niche mining. 

For future research, the following problems can be considered: How to push 
the semantic-based selection into a tree-based pattern mining algorithm? How 
to use niches to improve prediction accuracy in the classification process? It is 
also interesting to generalize our SAT-EP-Select algorithm, by considering it as 
a clustering problem for extremely high dimensions. 
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Abstract. When mining a large database, the number of patterns dis- 
covered can easily exceed the capabilities of a human user to identify in- 
teresting results. To address this problem, various techniques have been 
suggested to reduce and/or order the patterns prior to presenting them 
to the user. In this paper, our focus is on ranking summaries generated 
from a single dataset, where attributes can be generalized in many dif- 
ferent ways and to many levels of granularity according to taxonomic 
hierarchies. We theoretically and empirically evaluate thirteen diversity 
measures used as heuristic measures of interestingness for ranking sum- 
maries generated from databases. The thirteen diversity measures have 
previously been utilized in various disciplines, such as information the- 
ory, statistics, ecology, and economics. We describe five principles that 
any measure must satisfy to be considered useful for ranking summaries. 
Theoretical results show that only four of the thirteen diversity measures 
satisfy all of the principles. We then analyze the distribution of the index 
values generated by each of the thirteen diversity measures. Empirical re- 
sults, obtained using synthetic data, show that the distribution of index 
values generated tend to be highly skewed about the mean, median, and 
middle index values. The objective of this work is to gain some insight 
into the behaviour that can be expected from each of the measures in 
practice. 



1 Introduction 

When mining a large database, the number of patterns discovered can easily ex- 
ceed the capabilities of a human user to identify interesting results. To address 
this problem, various techniques have been suggested to reduce and/or order the 
patterns prior to presenting them to the user. For example, in P|, it is shown 
that the most interesting rules may reside along a support/confidence border. A 
technique is described in m that discovers interesting rules via an interactive 
process that seeks to classify rules that are not interesting. In |H|, a measure is de- 
scribed that determines the interestingness (called surprise there) of discovered 
knowledge via the explicit detection of Simpson’s Paradox. An approach is de- 
scribed in [3 that utilizes a distance metric to evaluate the importance of a rule 
by considering its unexpectedness in terms of other rules in its neighborhood. 
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Our focus is on the use of diversity measures for ranking summaries generated 
from a single dataset, where attributes can be generalized in many different 
ways and to many levels of granularity according to taxonomic hierarchies. We 
introduced this use of diversity measures in COl and El An empirical analysis 
found that highly ranked, concise summaries provided a reasonable starting point 
for further analysis of discovered knowledge. It was also shown that for selected 
sample datasets, the order in which some of the measures rank summaries is 
highly correlated, but the rank ordering can vary substantially when different 
measures are used. In the notion of a summary was extended to include other 
forms of knowledge representation, and we showed that these other forms are 
also amenable to ranking using diversity measures. And significant progress has 
been made into more theoretical issues regarding formal principles for diversity 
measures used as measures of interestingness in data mining applications HH. 

In this paper, we evaluate thirteen diversity measures as heuristic measures of 
interestingness for ranking summaries in data mining applications. We describe 
five principles that any measure must satisfy to be considered useful for ranking 
summaries. Our theoretical results show that only four of the thirteen diversity 
measures satisfy all of the principles. We then analyze the distribution of the 
index values generated by each of the thirteen diversity measures. Empirical 
results, obtained using synthetic data, show that the distribution of index values 
generated tend to be highly skewed about the mean, median, and middle index 
values. The objective of this work is to gain some insight into the behaviour that 
can be expected from each of the measures in practice so that when choosing a 
candidate interestingness measure, we can determine which of the five principles 
are satisfied, and then knowing the behavioural characteristics of each measure, 
judge the suitability of the candidate interestingness measure for the intended 
application. 

The remainder of the paper is organized as follows. In Section I3 we describe 
several forms of knowledge representation, which we collectively refer to as sum- 
maries, and motivate the need for ranking discovered knowledge. In Section 0 
we provide a brief overview of thirteen diversity measures introduced and eval- 
uated as heuristic measures of interestingness in previous work. In Section 0 we 
describe five principles that useful diversity measures must satisfy, and identify 
those diversity measures satisfying the five principles. In Section 0 we present 
experimental results describing the distribution of index values generated by 
each of the thirteen measures. We conclude in Section 0with a summary of our 
work and suggestions for future research. 

2 Background and Motivation 

Let a summary S' be a relation defined on the columns {(Ai,Di), (A 2 , 1 ) 2)5 
. . . , (A„, D„)}, where each (Ai^Di) is an attribute-domain pair. Also, let 
{{Ai,Vii), (A2 ,Vi2), • ■ • , {An,Vin)}, f = 1, 2, . . . , m, be a Set of TO unique tuples, 
where each (Aj,Vij) is an attribute- value pair and each Vij is a value from the 
domain Dj associated with attribute Aj . Let attribute Aj- be a derived attribute 
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whose values Vik, from the domain Dk, for each attribute- value pair {Ak,Vik) is 
an aggregation of values from the the unconditioned data present in the origi- 
nal database. For example, a sample summary is shown in Table ^ Tabled is 
a generalized relation in which retail sales transactions have been aggregated 
to show the derived attributes Quantity, Amount, and Count (i.e., number of 
transactions) by Region. 



Table 1. A generalized relation 



Region 


Quantity 


Amount 


Count 


North 


12 


$150.00 


7 


South 


5 


$325.00 


2 


West 


8 


$200.00 


4 


East 


11 


$275.00 


3 



The summary definition given above can also be naturally extended to include 
summaries that are multi-dimensional. For example, another sample summary, is 
shown in Figured Figured shows a data cube in which retail sales transactions 
have been aggregated in three dimensions, where the Rem attribute is on the 
vertical dimension. Transact. Loc is on the horizontal, and Cust.Loc is on the 
diagonal. Transact. Loc is the city where the sales transaction was processed, 
and Cust.Loc is the city where the sales transaction was initiated. Here we show 
each cell containing two values (due to space limitations); the top value is the 
quantity of items aggregated from sales transactions (i.e.. Quantity), and the 
bottom value is the number of transactions aggregated (i.e.. Count). 





Transact . Loc 



Fig. 1. A data cube 



Of course, numerous methods could be used to guide the generation of sum- 
maries, such as concept hierarchies |S|, domain generalization graphs [ I bj . Ga- 
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lois lattices 0, conceptual graphs 0, and formal concept analysis m- Also, 
summaries could more generally include many other forms of knowledge repre- 
sentation, such as database views, association rules, itemsets, and web search 
results. 

However, when given hundreds, or even thousands of summaries (possibly 
multi-dimensional), it is simply not feasible to determine the most interesting 
summaries or dimensions using a manual technique. What is needed are effective 
measures of interestingness to assist in the interpretation and evaluation of the 
discovered knowledge. The development of such measures is currently an active 
research area in KDD. Such measures are broadly classified as either objective 
or subjective. Objective measures are based upon the structure of discovered 
patterns, such as the frequency with which combinations of items appear in 
sales transactions 0. Subjective measures are based upon user beliefs or biases 
regarding relationships in the data, such as an approach utilizing Bayes Rule to 
revise prior beliefs m- Here we focus on objective measures of interestingness. 

3 Objective Interestingness Measnres 

The tuples in a summary or dimension generated from a database are unique, 
and therefore, can be considered to be a population with a structure that can 
be described by some frequency or probability distribution. Here, we review 
thirteen diversity measures, described in detail in and shown in Figure El 
that evaluate the frequency or probability distribution of the values in a derived 
attribute to assign a single real-valued index that represents its interestingness 
relative to other summaries or dimensions generated from the same database. 

In Figure El let w be the total number of tuples in a summary. Let be 
the value contained in the derived attribute for tuple Let N = ^)e 

the total of the derived attribute. Let p be the actual probability distribution of 
the tuples based upon the values rii. Let pi = rii/N he the actual probability for 
tuple ti- Let g be a uniform probability distribution of the tuples. Let u = N/m 
be the value for tuple ti, i = 1,2, ... ,m according to the uniform distribution 
q. Let q = 1/m be the probability for tuple ti, for all z = 1, 2, . . . , m according 
to the uniform distribution q. Let r be the probability distribution obtained by 
combining the values Ui and u. Let = {rii -|- u) /2N, be the probability for 
tuples ti, for all z = 1, 2, . . . , TO according to the distribution r. 

The measures shown in Figure El are well-known measures of dispersion, dom- 
inance, inequality, and concentration that have previously been successfully ap- 
plied in several areas of the social, ecological, information, and computer sciences. 
Although the terminology varies depending upon the application, the concept of 
diversity has been considered a useful one for analyzing many phenomena. For ex- 
ample, in ecology, various measures of diversity have been proposed and studied 
to aid in understanding the variability of populations of organisms within differ- 
ent types of habitat Diversity measures have also been used by economists 
and social scientists to study the distribution of income between different so- 
cioeconomic groups and geographical regions El • information theory, diversity 
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Fig. 2. Thirteen diversity measures 



measures are used to measure the information content in messages [zq. Diver- 
sity measures have been used to describe the linguistic differences between the 
inhabitants of neighboring geographic regions m- More general treatments at- 
tempt to define the concept of diversity and develop a related theory of diversity 
measurement usm. 

4 Theoretical Results 

We now describe principles of interestingness against which the utility of can- 
didate interestingness measures can be assessed. We do this through the math- 
ematical formulation of five principles that must be satisfied by any acceptable 
diversity measure for ranking the interestingness of discovered knowledge using 
our, or a similar, technique. Proofs are omitted due to space considerations, so 
refer to UBj and m for complete details. We study functions f of m variables, 
/(ni, . . . , rim), where / denotes a general measure of diversity, m and each rii {tii 
assumed to be non-zero) are as defined in the previous section, and (rii, . . . , rim) 
is a vector corresponding to the values in a derived numeric measure attribute 
(e.g., the Count values from the examples in Section 0for some arbitrary sum- 
mary whose values are arranged in descending order such that rii > . . . > Um 
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(except for discussions regarding I Lorenz, which requires that the values be ar- 
ranged in ascending order). The principles presented here are for ranking the 
interestingness of summaries generated from a single dataset, so we assume that 
N is fixed. We justify the non-zero assumption for the rii’s, as follows. If the 
value of the Count attribute for a particular tuple is zero, there are two possible 
reasons. Either the combination of domain values being counted in the tuple can 
occur in practice, but no occurrences have been encountered during the min- 
ing process, or else the combination of domain values being summarized cannot 
occur in practice, and no occurrences will ever be encountered (i.e., such an 
entity does not exist). So, to preserve and simplify the general applicability of 
our technique, we make no assumptions regarding the possibility of occurrence 
of particular combinations of domain values. We now begin by specifying two 
fundamental principles. 

Minimum Value Principle (PI). Given a vector (rii, . . . , Um), where = nj, 
i yf j, for all i, j, /(ni, . . . , Um) attains its minimum value. 

PI specifies that the minimum interestingness should be attained when the 
tuple counts are all equal (i.e., uniformly distributed). For example, given the 
vectors (2, 2), (50, 50, 50), and (1000, 1000, 1000, 1000), we require that the index 
value generated by / be the minimum possible for the respective values of m 
and N. 

Maximum Value Principle (P2). Given a vector (ni, . . . ,nm), where ni = 
N — m+1, ni = 1, i = 2, . . . , m, and N > m, /(ni, . . . ,nm) attains its maximum 
value. 

P2 specifies that the maximum interestingness should be attained when the 
tuple counts are distributed as unevenly as possible. For example, given the 
vectors (3, 1), (148, 1, 1), and (3997, 1, 1, 1), where to = 2, 3, and 5, respectively, 
and N = 4, 150, and 4000, respectively, we require that the index value generated 
by / be the maximum possible for the respective values of to and N. 

The behaviour of a measure relative to satisfying both PI and P2 is significant 
because it reveals an important characteristic about its fundamental nature as a 
measure of diversity. A measure of diversity can generally be considered either a 
measure of concentration or a measure of dispersion. A measure of concentration 
can be viewed as the opposite of a measure of dispersion, and we can convert 
one to the other via simple transformations. For example, if g corresponds to 
a measure of dispersion, then we can convert it to a measure of concentration 
/, where / = max(g) — g. Here we only consider measures of concentration. A 
measure was considered to be a measure of concentration if it satisfied PI and P2 
without transformation. A measure was considered to be a measure of dispersion 
if it satisfied PI and P2 following transformation. All measures of dispersion were 
transformed into measures of concentration prior to our analysis. 

Skewness Principle (P3). Given a vector (ni,...,nm), where ni = 
N — m + 1, Ui = 1, i = 2,..., TO, and N > m, and a vector (ni — 
c, U 2 , . . .,nm,nm+i, ■ ■ ■ ,rim+c), where ni - c > 1 and = 1, f = 2, . . . , to -I- c, 

f (^1 , ■ • ■ , ^m) ^ f (^1 0, 7Z2 J ■ ■ ■ J 5 ^m+ 1 5 ■ ■ ■ 5 ^m+c) ■ 
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P3 specifies that a summary containing m tuples, whose counts are dis- 
tributed as unevenly as possible, will be more interesting than a summary con- 
taining m -|- c tuples, whose counts are also distributed as unevenly as possible. 
For example, given the vectors (999, 1) and (997, 1, 1, 1) (i.e., c = 2), we require 
that /(999,1) > /(997, 1,1,1). 

Permutation Invariance Principle (P4). Given a vector (ni,...,nm) and 
any permutation (ii, of (1, . . . ,to), /(ni, . . . ,n^) = /(n^^, . . . ,ni„). 

P4 specifies that every permutation of a given distribution of tuple counts 
should be equally interesting. That is, interestingness is not a labeled property, 
it is only determined by the distribution of the counts. For example, given the 
vector (2,4,6), we require that /(2,4, 6) = /(4, 2,6) = /(4, 6,2) = /(2,6,4) = 
/(6,2,4) = /(6,4,2). 

Transfer Principle (P5). Given a vector {ni,...,nm) and 0 < c < rij, 

f (^1 5 ■ • ■ 5 “t“ C, . . . , Tlj C, . . . , Tljfi') ^ f i'^1^ • • • 1 5 • ■ • 7 • j ’^m) • 

P5, adapted from 0, specifies that when a strictly positive transfer is made 
from the count of one tuple to another tuple whose count is greater, then inter- 
estingness increases. For example, given the vectors (10,7,5,4) and (10,9,5,2), 
we require that /(10,9, 5, 2) > /(10,7,5,4). 

Those measures satisfying the above principles of interestingness are shown 
in Table |21 In Table |21 the PI to P5 columns describe the five principles, and a 
measure that satisfies a principle is indicated by the bullet symbol (i.e., •). 



Table 2. Measures satisfying the five principles 



Measure 


PI 


P2 


P3 


P4 


P5 


ariance 


• 


• 


• 


• 


• 


^Simpson 


• 


• 


• 


• 


• 


^Shannon 


• 


• 


• 


• 


• 


^McIntosh 


• 


• 


• 


• 


• 


Ihorenz 


• 


• 






• 


/Gini 


• 


• 




• 


• 


i Berger 


• 


• 


• 


• 




i Schutz 


• 


• 




• 




^ Bray 


• 


• 




• 




^ W hittaker 


• 


• 




• 




^ M ac Arthur 


• 


• 




• 


• 


^Theil 


• 






• 




^Atkinson 


• 


• 




• 


• 



5 Experimental Results 

We now analyze the distribution of the index values generated by each of the 
thirteen measures. Input data consists of two populations of vectors shown in 
Table 0 where index values for 16,928 vectors (i.e., all possible ordered arrange- 
ments of a population of 50 objects among 10 classes) and 2,611 vectors (i.e., 
all possible ordered arrangements of a population of 50 objects among 5 classes) 
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were generated. The choice of vectors to evaluate here was made somewhat arbi- 
trarily, but it does provide a large, controlled population of index values in which 
a gradual change in evenness occurs from the most highly skewed distribution 
in the first vector, to the uniform distribution in the last vector. 



Table 3. Ordered arrangements of two populations 




Histograms of the absolute frequencies of the index values for the vectors in 
Table 0 were generated for each measure. Again, due to space limitations, we 
cannot show all of these histograms. However, sample histograms of the index 
values generated for the population of 50 objects among 10 classes by Ivariance 
and Is chut z are shown in Figures 0 and 0 respectively. In Figures 0 and 0 the 
horizontal and vertical axes describe intervals for the index values generated 
and the number of index values that fall in each interval, respectively. For ex- 
ample, the histogram for Ivariance shows that 68 index values were generated 
on the interval (0.000,0.0009], 1,106 on (0.0009,0.003], 2,464 on (0.003,0.005], 
3,006 on (0.005,0.007], 2,581 on (0.007,0.008], 2,055 on (0.008,0.010], 1,549 on 
(0.010,0.012], and 4,099 on the remaining intervals in (0.012,0.065]. A curve 
describing the standard normal distribution (SND) of the index values is super- 
imposed over the observed frequencies. 

To provide a summary description of each histogram, we can use the skewness 
and kurtosis for the distribution of index values. Skewness is a measure of the 
symmetry of a distribution. It has a value of zero when the distribution is a 
symmetrical curve (i.e., as in a SND). If the skewness is different from zero, 
then the distribution is asymmetrical. A positive (negative) value indicates the 
index values are clustered more to the left (right) of the mean, with most of 
the extreme index values to the right (left) of the mean. In general, for positive 
(negative) skewness, we have mode < median < mean (mean < median < mode). 
Kurtosis is a measure of the relative peakedness of a distribution and indicates 
the extent to which outliers cause the distribution to differ from the SND. When 
a distribution follows the SND, it has value of zero. When the value is greater 
than (less than) zero, the distribution has a sharper (flatter) peak than the SND 
and is more (less) prone to containing outliers. 

The skewness and kurtosis for all measures are shown in Table 0 In Table 0 
mnemonics are provided as an aid to interpreting the curves described by the 
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Fig. 3. Histogram for Ivariance 




Fig. 4. Histogram for Ischutz 



values. The skewness mnemonics describe the symmetry of the frequency distri- 
bution in relation to the mean (i.e., AL = asymmetrical left, AR = asymmetrical 
right, NS = near symmetrical, and S = symmetrical) and the kurtosis mnemon- 
ics describe the relative peakedness of the frequency distribution in relation to 
the SND (i.e., SP = sharp peaked, NSN = near standard normal, MP = more 
peaked, and LP = less peaked). For example, the histogram for Ivariance, shown 
in Figure El has a skewness and kurtosis of approximately 1.8 and 5.6, respec- 
tively. This means that the distribution of index values is asymmetrical to the 
left of the mean (i.e., AL) and more sharply peaked than the SND (i.e., SP). 
Similarly, in the histogram for Ischutz, shown in Figure 5, the distribution of 
index values is near symmetrical (i.e., NS) and less peaked than the SND (i.e., 
LP). The other measures in Tabled can also be interpreted similarly. 

We now determine the number of index values generated by each measure 
that are less than and greater than the middle index value (i.e., {minimum + 
maximum) /2), and less than and greater than the median (i.e., the value for 
which 50% of the generated index values lie below and 50% lie above) . Our belief 
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Table 4. Skewness and kurtosis of the index values for the two populations 



Measure 


50 objects / 10 classes 


50 objects / 5 classes 


Skewness 


Kurtosis 


Skewness 


Kurtosis 


Iv ariance 


1.84421 


AL 


5.571732 


SP 


1.55959 


AL 


3.273237 


SP 


^Simpson 


1.84421 


AL 


5.571732 


SP 


1.55959 


AL 


3.273237 


SP 


^Shannon 


-0.95761 


AR 


1.357844 


MP 


-1.03452 


AR 


1.391038 


MP 


I M cintosh 


-1.24351 


AR 


2.317341 


SP 


-1.13072 


AR 


1.496420 


SP 


i Lorenz 


0.14435 


S 


-0.232495 


NSN 


0.02128 


S 


-0.317871 


NSN 


ICini 


-0.14435 


S 


-0.232495 


NSN 


-0.02128 


S 


-0.317871 


NSN 


^Berger 


0.97607 


AL 


1.139526 


SP 


0.75039 


AL 


0.264196 


SP 


^ Schutz 


0.13192 


NS 


-0.130277 


LP 


0.27521 


NS 


-0.076436 


LP 


I Bray 


-0.13192 


NS 


-0.130277 


LP 


-0.27521 


NS 


-0.076436 


LP 


hittaker 


-0.13192 


NS 


-0.130277 


LP 


-0.27521 


NS 


-0.076436 


LP 


I M ac Arthur 


0.68369 


AL 


0.485805 


MP 


0.86586 


AL 


0.883313 


MP 


^Theil 


-0.05563 


S 


-0.236451 


NSN 


0.68371 


AL 


1.112360 


MP 


^Atkinson 


0.16650 


NS 


-0.422023 


LP 


0.30949 


AL 


-0.476633 


LP 



is that a useful measure of interestingness should generate index values that are 
reasonably distributed throughout the range of possible values (such as in a 
SND). Again, we analyze the index values generated from the two populations 
shown in Table 0 with the results shown in Tables|^and0 In Tables|3and0, the 
Minimum and Maximum columns describe the minimum and maximum index 
values generated by each measure, respectively, the Middle column describes the 
middle index value, the < Middle and > Middle columns describe the number 
of index values less than and greater than the middle index value, respectively, 
and the Median column describes the median index value. For example, for the 
Ivariance measure, the minimum and maximum index values are 0.0 and 0 . 064 , 
respectively, the middle index value is 0 . 032 , 16,761 ( 167 ) index values lie below 
(above) the middle index value, and the median index value is 0 . 00791 . The 
distribution of index values in Tables El and El is highly skewed about the middle 
and median values for most of the measures. Isolated exceptions include I Bray 
and Iwhittaker in Table El and II orenz and loim in Table El 



Table 5. Distribution of index values for 50 objects among 10 classes 



Measure 


Minimum 


Maximum 


Middle 


< Middle 


> Middle 


Median 


ariance 


0.0 


0.064 


0.032 


16761 


167 


0.007911 


^ Simpson 


0.1 


0.676 


0.388 


16761 


167 


0.1712 


^Shannon 


1.250664 


3.321928 


2.286295 


613 


16315 


2.860161 


^McIntosh 


0.207096 


0.7964 


0.50175 


509 


16419 


0.682799 


i Lorenz 


0.214 


0.55 


0.37 


12353 


4575 


0.338 


ICini 


0.107 


0.275 


0.185 


4786 


12142 


0.169 


^Berger 


0.14 


0.82 


0.46 


15836 


1092 


0.28 


^ Schutz 


0.0 


0.72 


0.36 


10751 


6177 


0.34 


^ Bray 


0.28 


1.0 


0.64 


7549 


9379 


0.66 


^Whittaker 


0.28 


1.0 


0.64 


7549 


9379 


0.66 


^ M ac Arthur 


0.0 


0.420842 


0.21042 


15683 


1245 


0.114606 


^Theil 


0.0 


2.141432 


1.07072 


5550 


11378 


1.21593 


1 Atkinson 


0.0 


0.71 


0.35503 


11432 


5496 


0.296977 
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Table 6. Distribution of index values for 50 objects among 5 classes 



Measure 


Minimum 


Maximum 


Middle 


< Middle 


> Middle 


Median 


iv ariance 


0.0 


0.162 


0.081 


2507 


104 


0.0258 


i Simpson 


0.2 


0.848 


0.524 


2507 


104 


0.3032 


^Shannon 


0.562179 


2.321928 


1.44205 


164 


2447 


1.940238 


^ M cintosh 


0.092165 


0.643839 


0.36800 


200 


2411 


0.523381 


i Lorenz 


0.24 


0.6 


0.42 


1496 


1115 


0.412 


iGini 


0.12 


0.300 


0.21 


1183 


1428 


0.0.206 


^Berger 


0.2 


0.92 


0.56 


2180 


431 


0.42 


^ Schutz 


0.0 


0.72 


0.36 


1850 


761 


0.3 


iBray 


0.28 


1.0 


0.64 


939 


1672 


0.7 


^Whittaker 


0.28 


1.0 


0.64 


939 


1672 


0.7 


^ M ac Arthur 


0.0 


0.427524 


0.213765 


2425 


186 


0.099571 


iTheil 


0.0 


1.759749 


0.879875 


2357 


254 


0.566115 


^Atkinson 


0.0 


0.784944 


0.39247 


1964 


647 


0.283374 



6 Conclusion and Future Research 

The use of diversity measures for ranking the interestingness of summaries gener- 
ated from databases is a new application area. Here we theoretically and experi- 
mentally analyzed thirteen diversity measures. Five principles of interestingness 
for useful diversity measures were described. Theoretical results showed that 
only four of the thirteen diversity measures satisfied all five principles. Exper- 
imental results showed that the distribution of index values, in relation to the 
mean, is least skewed for iLorenzj ^ Schutz-> ^Bray^ and I\Yhittakerj but these 

measures are poorly behaved, containing a sharp peak, or multiple sharp peaks, 
in the frequency distribution of the index values. The remaining eight measures 
were skewed asymmetrically in relation to the mean, and more or less peaked 
than the SND. The experimental results also show that the distribution of the 
index values is highly skewed, in relation to the middle and median values, for 
most of the measures. 

Future research will focus on extending the theory of interestingness for di- 
versity measures used to rank summaries. New principles will be developed for 
ranking the interestingness of summaries generated from different sources (i.e., 
related, but physically, logically, or temporally independent databases). 
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Peculiarity Oriented Mining and Its Application 
for Knowledge Discovery in Amino- Acid Data 
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Abstract. The paper proposes a way of peculiarity oriented mining 
and its application for knowledge discovery in the amino-acid data set. 
We introduce the peculiarity rules as a new type of association rules, 
which can be discovered from a relatively small number of peculiar data 
by searching the relevance among the peculiar data. We argue that the 
peculiarity rules represent a typically unexpected, interesting regularity 
hidden in the amino-acid data set. 



1 Introduction 

Peculiarity is a kind of interestingness. Peculiarity relationships/rules (with com- 
mon sense) may be hidden in a relatively small number of data. Generally speak- 
ing, hypotheses (knowledge) generated from databases can be divided into the 
following three types: 

— Incorrect hypotheses. 

— Useless hypotheses. 

— New, surprising, interesting hypotheses. 

The purpose of data mining is to discover new, surprising, interesting knowledge 
hidden in databases. Hence, the evaluation of interestingness (including pecu- 
liarity, surprisingness, unexpectedness, usefulness, novelty) should be done in 
pre-processing and/or post-processing of the knowledge discovery process 

In the paper, we discuss a way of mining peculiarity rules from the amino-acid 
data set. Section 2 introduces the peculiarity rules as a new type of association 
rules, which can be discovered from a relatively small number of the peculiar 
data by searching the relevance among the peculiar data. Sections 3 describes a 
method of finding the peculiar data/rules. Then in Section 4, we discuss a result 
of mining from the amino-acid data set. We shows that the peculiarity rules 
represent a typically unexpected, interesting regularity hidden in the amino- 
acid data set. Finally, Section 5 gives conclusions and outlines further research 
directions. 
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2 Association Rules vs. Peculiarity Rules 

Association rules are an important class of regularity hidden in transaction 
databases m The intuitive meaning of such a rule is that transactions of 
the database which contain X tend to contain Y. So far, three categories of the 
association rules, the general rule, the exception rule, and the peculiarity rule 
have been investigated umBj. 

A general rule is a description of a regularity for numerous objects and rep- 
resents the well-known fact with common sense, while an exception rule is for a 
relatively small number of objects and represents exceptions to the well-known 
fact. Usually, the exception rule should be associated with a general rule as a set 
of rule pairs. For example, the rule “using a seat belt is risky for a child” which 
represents exceptions to the general rule with common sense “using a seat belt 
is safe”. 

Zhong et al proposed peculiarity rules as a new type of association rules HS|. 
A peculiarity rule is discovered from the peculiar data by searching the relevance 
among the peculiar data. Roughly speaking, a data is peculiar if it represents 
a peculiar case described by a relatively small number of objects and is very 
different from other objects in a data set. Although it looks like the exception 
rule from the viewpoint of describing a relatively small number of objects, the 
peculiarity rule represents the well-known fact with common sense, which is a 
feature of the general rule 0- 

We argue that the peculiarity rules are a typical regularity hidden in a lot 
of scientific, statistical, and transaction databases. Sometimes, the general rules 
that represent the well-known fact with common sense cannot be found from nu- 
merous scientific, statistical or transaction data, or although they can be found, 
the rules may be uninteresting ones to the user since data are rarely specially 
collected/stored in a database for the purpose of mining knowledge in most or- 
ganizations. Hence, the evaluation of interestingness (including surprisingness, 
unexpectedness, peculiarity, usefulness, novelty) should be done before and/or 
after knowledge discovery Mm . In particular, unexpected (common sense) re- 
lationships/rules may be hidden in a relatively small number of data. Thus, we 
may focus on some interesting data (the peculiar data), and then we find more 
novel and interesting rules (peculiarity rules) from the data. For example, the 
following rules are the peculiarity ones that can be discovered from a relation 
called Japan- Geography (see Tabled ^ Japan-Survey database: 

rulei : ArableLand(large) & Forest(large) — )> PopulationDensity{low). 
rule2 ■ ArableLand(small) & Forest (small) — ^ PopulationDensity(high) . 

In order to discover the rules, we first need to search the peculiar data in the 
relation Japanese- Geography. From Tabled we can see that the values of the 
attributes ArableLand and Forest for Hokkaido (i.e. 1209 Kha and 5355 Kha) and 
for Tokyo and Osaka (i.e. 12 Kha, 18 Kha, and 80 Kha, 59 Kha) are very different 
from other values in the attributes. Hence, the values are regarded as the peculiar 
data. Furthermore, rulei and rule2 are generated by searching the relevance 
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among the peculiar data. Note that we use the qualitative representation for 
the quantitative values in the above rules. The transformation of quantitative to 
qualitative values can be done by using the following background knowledge on 
information granularity: 

Basic granules: 

bgi = {high, low}; bg 2 = {large, small}; 
bgs = {many, little}; bg^ = {far, close}; 
bg 5 = {long, short}; 



Specific granules: 

biggest- cities = { Tokyo, Osaka}; 
kanto-area = {Tokyo, Tiba, Saitama, ...}; 
kansei-area = {Osaka, Kyoto, Nara, ...}; 



That is, ArableLand = 1209, Forest = 5355 and PopulationDensity = 67.8 for 
Hokkaido are replaced by the granules, “large” and “low”, respectively. Further- 
more, Tokyo and Osaka are regarded as a neighborhood (i.e. the biggest cities 
in Japan). Hence, rule 2 is generated by using the peculiar data for both Tokyo 
and Osaka as well as their granules (i.e. “small” for ArableLand and Forest, and 
“high” for PopulationDensity). 



Table 1. Japan-Geography 



Region 


Area 


Population 


PopulationDensity 


PeasantFamilyN 


ArableLand 


Forest 




Hokkaido 


82410.58 


5656 


67.8 


93 


1209 


5355 




Aomori 


9605.45 


1506 


156.8 


87 


169 


623 




Tiba 


5155.64 


5673 


1100.3 


116 


148 


168 




Tokyo 


2183.42 


11610 


5317.2 


21 


12 


80 




Osaka 


1886.49 


8549 


4531.6 


39 


18 


59 





3 Peculiarity Oriented Mining 

This section describes a way of mining peculiarity rules. 

3.1 Finding the Peculiar Data 

There are many ways of finding the peculiar data. In this section, we describe 
an attribute-oriented method. 

Let X = {xi,X 2 , • . ■ , Xn} be a data set related to an attribute in a relation, 
and n is the number of different values in an attribute. The peculiarity of xi can 
be evaluated by the Peculiarity Factor, PF{xi), 
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PF{xi) = (1) 

i=i 

It evaluates whether Xi occurs relatively small number and is very different from 
other data Xj by calculating the sum of the square root of the conceptual distance 
between Xi and Xj. The reason why the square root is used in Eq. o is that 
we prefer to evaluate more near distances for relatively large number of data so 
that the peculiar data can be found from relatively small number of data. 
Major merits of the method are 

— It can handle both the continuous and symbolic attributes based on a unified 
semantic interpretation; 

— Background knowledge represented by binary neighborhoods can be used to 
evaluate the peculiarity if such background knowledge is provided by a user. 

If Ai is a data set of a continuous attribute and no background knowledge is 
available, in Eq. 



N{x^,Xj) = \xi - Xj\. (2) 

Table 2 shows an example for the calculation. On the other hand, if X is a data 
set of a symbolic attribute and/or the background knowledge for representing 
the conceptual distances between Xi and Xj is provided by a user, the peculiarity 
factor is calculated by the conceptual distances, N{xi,Xj). Table 0(a) shows an 
example in which the binary neighborhoods shown in Table 0(b) are used as the 
background knowledge for representing the conceptual distances of different type 
of restaurants |6I14| . However, all the conceptual distances are 1, as default, if 
background knowledge is not available. 



Table 2. An example of peculiarity factors for a continuous attribute 



Region 


ArableLand 


Hokkaido 


1209 


Tokyo 


12 


Osaka 


IS 


Yamaguchi 


162 


Okinawa 


147 





PF 


- 


134.1 

60.9 

60.3 
60.5 

59.4 



After the evaluation for the peculiarity, the peculiar data are extracted by 
using a threshold value. 



threshold = mean of PF{xi) Fax standard deviation of PF{xi), (3) 

where a can be adjusted by a user, and a = 1 as default. That is, if PF{xi) is 
over the threshold value, Xi is a peculiar data. 

Based on the preparation stated above, the process of finding the peculiar 
data can be outlined as follows: 
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Table 3. An example of peculiarity factors for a symbolic attribute (Restaurants) and 
their conceptual distances 



(a) Peculiarity factors 



Restaurant 


Type 


Wendy 
Le Chef 
Great Wall 
Kiku 

South Sea 


American 

French 

Chinese 

Japanese 

Chinese 





PF 




6.6 




7.2 




4.7 




5.5 




4.7 



(b) Conceptual distances 



Type 


Type 


N 


Chinese 


Japanese 


1 


Chinese 


American 


3 


Chinese 


French 


4 


American 


French 


2 


American 


Japanese 


3 


French 


Japanese 


3 



Step 1. Calculate the peculiarity factor PF{xi) in Eq. (P) for all values in a 
data set (i.e. an attribute). 

Step 2. Calculate the threshold value in Eq. Q based on the peculiarity factor 
obtained in Step 1. 

Step 3. Select the data that is over the threshold value as the peculiar data. 

Step 4- If the current peculiarity level is enough, then goto Step 6. 

Step 5. Remove the peculiar data from the data set and thus, we get a new 
data set. Then go back to Step 1. 

Step 6. Change the granularity of the peculiar data by using background knowl- 
edge on information granularity if the background knowledge is available. 

Furthermore, the process can be done in a parallel-distributed mode for multiple 
attributes, relations and databases since this is an attribute-oriented finding 
method. 

3.2 Relevance among the Peculiar Data 

A peculiarity rule is discovered from the peculiar data by searching the relevance 
among the peculiar data. Let X{x) and Y{y) be the peculiar data found in two 
attributes X and Y respectively. We deal with the following two cases: 

— If both X{x) and Y{y) are symbolic data, the relevance between X(x) and 
Y (y) is evaluated in the following equation: 



i?i = Pi{X{x)\Y{y))P 2 {Y{y)\X{x)). 



( 4 ) 
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That is, the larger the product of the probabilities of P\ and P 2 , the stronger 
the relevance between X{x) and Y{y). 

— If both X{x) and Y{y) are continuous attributes, the relevance between 
X{x) and Y{y) is evaluated by using the method developed in our KOSI 
system irq 

Furthermore, Eq. m is suitable for handling more than two peculiar data found 
in more than two attributes if X{x) (or Y{y)) is a granule of the peculiar data. 



4 Application in Amino- Acid Data Mining 

Some of databases such as Japan-survey, web-log, weather, supermarket, and 
amino-acid data have been tested or have been testing for our approach. This 
section discusses a result of mining from the amino-acid data set C3- 

The amino-acid data set can be divided into two groups: amino-acid matrix 
(including VH and VL amino-acid matrixes) and experimental data (including 
combining coefficients and coefficients related thermodynamics). The main fea- 
tures of the data set can be summarized as follows: 

— The number of the attributes is quite many. That is, the number of the 
attributes with respect to the amino-acid matrix is 230, the number of the 
attributes with respect to experimental data is 7. 

— The number of the instances is relatively small and only small number of 
data in the amino-acid matrix changes. 

The objective of data mining is to find the association between the amino- 
acid matrix and experimental data. That is, how experimental data change when 
amino-acid data are changed. 

At first, we find the peculiar data in all attributes respectively by using the 
method stated in Section 01 As a result, the data denoted in a bold type style 
in Tables 4, 5 and 6 are the peculiar data. Note that in Tables 4, 5 and 6, the 
last tuple T is threshold calculated in Eq. m and a is 1. 

From the tuple 23 (i.e. No 23) in Tables 4, 5 and 6, we can see that the 
value 42 in the attribute Ka (combining coefficients) is a peculiar data and the 
maximum one in Ka, and no any change in the amino-acid matrix. Therefore, we 
focus on the attribute Ka and search the minimum value in Ka. In other words, 
we want to find how coefficients related thermodynamics and the amino-acid 
matrix change when combining coefficients have big change. We found that 

— The value 0.04 in the attribute Ka (the tuple 26, i.e. No 26) is the minimum 
one; 

— In the same tuple (the tuple 26, i.e. No 26), the values related thermody- 
namics: -32.6 in DG, -53.4 in DH, -0.92 in DCp are peculiar data; 

— The value a in 32 of VL amino-acid matrix is also a peculiar one. 
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Furthermore, we found that there is a functional relationship between Ka and 
DG ^ 2 |- Therefore, we just use one attribute, Ka or DG, when generating a 
peculiarity rule. 

In summary, the discovered rules are 

If the value in 32 of VL amino-acid matrix is changed to a, 

Then the value of Ka is the minimum one and the values of DH and 
DCp are peculiar ones. 



or 



If the value of Ka is the minimum one and the values of DH and DCp 

are peculiar ones, 

Then the value in 32 of VL amino-acid matrix is changed to a. 

The result has been evaluated by an expert H2]. According to his opinion, 
the discovered rules are reasonable and interesting. 

We argue that the peculiarity rules represent a typically unexpected, inter- 
esting regularity hidden in the amino-acid data set. The rules are peculiar ones 
rather than exceptions because of semantic common sense. 



Table 4. VH amino-acid matrix with peculiar data and their PF values 
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Table 5. VL amino-acid matrix with peculiar data and their PF values 




T| 11.26 I 11-26 I 11.26 I 11-26 | 7.44 | 11.26 | 7.44~~| 



Table 6. Experimental data and their PF values 



No 


Ka X107/M-1 


PF(x) |DG/Jmol-l 


PF(x) |DH/Jmol-l 


PF(x) ITDS/Jmol 


1 PF(x) IDCp/Jmol-lK 


1 PF(x) 


1 


9.6 


74.38 


-46.3 


44.2 


-97.9 


119.79 


-51,6 


113.6 


-2.25 


19.67 


2 


10 


74.85 


-46.4 


43.95 


-112.9 


145.52 


-66,5 


142,47 


-2.15 


19.71 


3 


16.9 


92.03 


-47.7 


49.28 


-108.7 


136.72 


-61 


129.63 


-2.26 


20.02 


4 


22 


108.06 


-48.5 


54.07 


-115.8 


154.33 


-67.3 


144.9 


-2.25 


19.67 


5 


ND 


0 


ND 


0 


ND 


0 


ND 


0 


ND 


0 


6 


1.5 


83.08 


-41.4 


63.93 


-67.3 


138.89 


-25.9 


123.67 


-1.8 


18.15 


7 


7.1 


75.62 


-45.6 


46.3 


-73.2 


131.2 


-27,6 


121,36 


-1.81 


18.19 


8 


2.3 


80.84 


-42.6 


58.04 


-65.6 


141.15 


-23 


128.26 


-1.78 


18.19 


9 


ND 


0 


ND 


0 


ND 


0 


ND 


0 


ND 


0 


10 


0.37 


86.69 


-38 


80.07 


-53.9 


163.18 


-15,9 


145.42 


-1.38 


18.4 


11 


2.9 


79.65 


-43.1 


55.11 


-59.8 


150.34 


-16.7 


143.06 


-1.27 


20.37 


12 


0.19 


87.62 


-36.4 


87.81 


-60.2 


149.47 


-23.8 


126.77 


ND 


0 


13 


11 


76.25 


-46.4 


43.95 


-81.9 


123 


-35.4 


114.99 


-1.43 


18.35 


14 


15 


85.34 


-47.2 


45.93 


-84.4 


119.01 


-37.2 


113.34 


-0.98 


23.6 


15 


13 


80.01 


-46.8 


44.58 


-98.6 


120.69 


-51.8 


113.9 


-1.38 


18.4 


16 


12 


77.05 


-47 


45.42 


-90.5 


118.37 


-42,5 


112,02 


-1.4 


18.07 


17 


2.6 


80.08 


-43.1 


55.11 


-72.3 


132,28 


-29.2 


119,57 


-1 


23.32 


18 


3.4 


79.3 


-43.5 


54.23 


-62.3 


146.21 


-18.8 


137,83 


-0.92 


24.47 


19 


23 


111.74 


-48.5 


54.07 


-85.3 


119.49 


-36.8 


113.47 


-1.68 


18.19 


20 


ND 


0 


ND 


0 


ND 


0 


ND 


0 


ND 


0 


21 


21 


105.05 


-48.2 


52.46 


-113.7 


147.16 


-65.5 


140.19 


-1.9 


18.69 


22 


4 


79.16 


-44.1 


52.63 


-114.2 


148.64 


-70.1 


154.37 


-2.1 


19.56 


23 


42 


169.11 


-50.2 


67.3 


-91.5 


117,98 


-41,3 


112,39 


-1.4 


18.07 


24 


34 


147.06 


-49.5 


62.16 


-106.3 


130,92 


-56,8 


122,24 


-2.42 


23.28 


25 


8.8 


74.45 


-46.1 


44.36 


-105.8 


129.88 


-59.7 


126.2 


-2.31 


21.13 


26 


0.04 


88.67 


• 32.6 


105.24 


- 53.4 


164.42 


-20.8 


133.17 


- 0.92 


24.47 


27 


0.97 


84.53 


-40.5 


68.11 


-53.6 


163.73 


-13.1 


154,83 


-1.59 


18.17 


28 


0.64 


85.6 


-39.4 


73.35 


-84.4 


119.01 


-45 


111.82 


-1.02 


23.15 


29 


9.2 


74.25 


-46.1 


44.36 


-76 


129.01 


-30.1 


119.02 


-1.64 


18.26 


30 


7.8 


75.17 


-45.7 


45.96 


-96.8 


119.11 


-51.1 


113.54 


-2.25 


19.67 


31 


6.9 


75.93 


-45.5 


46.87 


-105 


128,96 


-59,5 


125.84 


-2.39 


22.67 


32 


15.6 


87.28 


-47.5 


48 


-101.6 


125.09 


-54,1 


118.04 


-2.24 


19.81 


33 


14 


82.61 


-47.2 


45.93 


-93.6 


117,85 


-46.4 


111.9 


-1.4 


18.07 


34 


ND 


0 


ND 


0 


ND 


0 


ND 


0 


ND 


0 


35 


12 


77.05 


-46.8 


44.58 


-94.6 


118.08 


-47.8 


112.52 


-1.97 


19.04 


— 




112.48 




71.65 




164.16 




153.35 




24.44 
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5 Conclusion 

We presented a way of peculiarity oriented mining and its application for knowl- 
edge discovery in the amino-acid data set. The peculiarity rules are defined as a 
new type of association rules that is a kind of regularity hidden in a relatively 
small number of peculiar data. We showed that the peculiarity rules represent a 
typically unexpected, interesting regularity hidden in the amino-acid data set. 

Since this project is very new, we just had a preliminary result in knowl- 
edge discovery from the amino-acid data set. Our future work includes using 
more domain knowledge in the knowledge discovery process, mining in multiple 
information sources, and developing an agent-based mining system. 

Acknowledgements. The authors would like to thank Prof. S. Tsumoto and 
Prof. K. Tsumoto for providing the amino-acid data set and background knowl- 
edge, and evaluating the experimental results. 
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Abstract. This paper presents a sequence pattern mining technique to mine data 
generated from a wind tunnel experiment. The goal is to discover the nonlinear 
input-output relationship for a delta wing aircraft. In contrast to categorical 
datasets, the output variahle(s) in this dataset is continuous and takes distinct 
values, which is common in physical experiments. Directly applying existing 
decision tree or rule induction mining methods fails to discover sufficient 
knowledge. Therefore, we propose to extend current techniques hy 
constructing sequence patterns that represent the output variations in certain 
ranges of selective inputs. Similar sequence patterns are clustered based on a 
weighted variance measure. Rules can then be derived from similar sequences 
to predict the output. This technique has been applied to the experimental data 
and generates rules useful for flight control. 



1 Introduction 

Existing data mining methods such as decision tree induction[l 1], rule derivation [1] 
or Bayesian learning [3], have largely focused on datasets with nonnumeric or 
categorical variables. Therefore, these methods are suitable for such applications as 
product forecasting or cross-selling where categorical variables prevail. However, 
data generated from scientific experiments are different from conventional datasets in 
the following aspects: 

• Numerical variables involved are continuous and may take distinct real numbers 
within valid ranges. 

• Strong casual relationship exists among these numerical variables. The outcome 
of one output variable is often correlated with all the input variables. 

Therefore, new approaches are required to discover knowledge from these 
experimental data. 

In this paper, we are focused on a dataset generated by the MEMS UAV 
(Uninhabited Aerial Vehicle) project in the Mechanical and Aerospace Engineering 
Department at UCLA. The data is collected from a wind tunnel in which a delta wing 
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aircraft model is mounted [7]. Each tuple correlates one particular input configuration 
of the aircraft model with the corresponding force loading outputs. The goal of 
mining this dataset is to derive the highly nonlinear input-output relationship for the 
aircraft. Such knowledge will be useful for flight control. The preliminary dataset 
contains 192 tuples summarized from the wind tunnel experiments and provides 
insights into aircraft maneuvering via MEMS devices. 

Traditional mining methods generate knowledge to predict output variables base on 
a subset of input variables. This approach is acceptable in many business-related 
applications where a portion of the inputs is sufficient to predict the output. In 
physical system control such as the delta wing aircraft with MEMS actuators, 
however, output variables (e.g. force and moments) are highly dependent on all input 
variables (angle of attack, stream velocity, actuation position, see Figure 1 and Figure 
2). Under such environments, existing algorithms are unable to derive input-output 
relationship that covers all the cases. 

To remedy this problem, we transform the original dataset by merging the output 
with several inputs into a composite output variable called sequence. More precisely, 
a sequence is defined as the output variation in a certain range of selected inputs. The 
transformed results enable us to cluster similar sequences via a bottom-up algorithm. 
Existing methods, e.g. rule induction, can then be applied to these sequence clusters. 
Using such an approach, we are able to derive fairly complete input-output 
relationship for the wind tunnel experimental data. 

Scientific discovery research has been existing for more a decade. Its goal is to 
find knowledge that is novel, interesting, plausible, and understandable [14]. From 
this general perspective, scientific discovery shares common characteristics with that 
of knowledge discovery (data mining) in business applications. This work is strongly 
influenced by the scientific discovery viewpoint and yet leveraged on the existing data 
mining techniques in discovering interesting patterns from a scientific dataset. The 
resulting rules are special cases of the qualitative and quantitative laws in the general 
scientific discovery framework [8]. 

The rest of the paper is organized as follows. Section 2 gives a brief background 
on the aircraft control principles and shows the deficiencies of directly applying 
traditional methods. In Section 3, we propose the sequence clustering technique and 
apply it to the wind tunnel experimental dataset. Section 4 concludes the paper and 
provides future research directions. 



2 Control of a Delta Wing Aircraft 

MEMS UAV (Uninhabited Aerial Vehicle), an ongoing project at UCLA, has 
demonstrated the possibility of using MEMS micron-scale actuation devices to 
control macro-scale machines, e.g., an aircraft. Such a design has numerous 
advantages in reducing weight, overall power consumption and radar cross-section. 
The project uses the “vortex” control method to provide forces and moments for 
controlling the aircraft. Typically, a delta wing aircraft will produce pairs of vortices 
above the wings (Figure 1). These vortices are sources of low-pressure flows that 
provide “suction”, which produces a portion of lift for the aircraft. Airflow blows 
toward the delta wing, first hitting the lower surface and then moving up toward the 
upper surface, eventually detaching near the leading edge and creating the vortices 
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pair. Numerous researches have 
shown that the genesis location of 
these vortices, i.e. the detaching 
positions, is very important to the 
characteristics of the resulting large 
primary vortices [9, 10]. By placing 
MEMS actuators near this location, 
the symmetry of these high 
“suction” vortices is broken. As a 
result, aerodynamic loadings on the 
aircraft can be controlled. 



2.1 Problem Description 

The key for aircraft control is accurately predicting the aircraft’s force loadings based 
on certain environment settings and an actuation position. Load measurements for the 
delta wing, with a six-component force balance, are divided into two categories: 
forces and moments. Each category has three variables, corresponding to the three 
dimensions. Environment settings include the wind tunnel stream velocity and the 
aircraft’s angle of attack. Figure 2 visually interprets these terms. As shown in the 
figure, the delta wing is equipped with rounded leading edges. Note that the actuation 
position is one point at each cross section, forming a straight line along the whole 
leading edge. This position is represented by an angle value, ranging from 0° to 180°. 




a) Environment settings: angle of attack, stream velocity. actuation angle affects the force and momenf outpuf of 

the delta wing aircraft. 



Fig. 2. Input variables for a delta wing aircraft 

In the rest of the paper, the variables about the environment settings and the 
actuation angle are referred to as ‘input variables’, whereas the variables about the 
force balance outputs as ‘output variables’ . 

Wind tunnel experimental results have shown drastic variances of the force balance 
outputs with different environment and actuation settings. Table 1 shows part of the 
dataset that relates the rolling moment output (one component in the force balance 
outputs) with corresponding input values. 



2.2 Data Characteristics 

Due to current experiment design, the input variables (i.e. angle of attack, stream 
velocity and actuation angle) only take a small number of distinct values. Therefore, 
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these inputs can be treated as categorical. In contrast, the output variable has distinct 
values for all tuples and ranges from the set of real numbers. 

Furthermore, the output variable is dependent on all the input variables. Knowing 
even two of the three inputs is insufficient to predict the output. For example, using 
two variable combinations like “angle of attack = 20 and stream velocity = 10” cannot 
predict the rolling moment output (Table 1). This characteristic greatly undermines 
the effectiveness of decision tree or rule induction methods, where the output is 
predicted only based on a subset of the inputs. The detailed results of decision tree 
and rule mining are shown in the appendix. 



Table 1. Wind tunnel experiment results of the delta wing aircraft. Angle of attack, stream 
velocity and actuation angle are the input variables and rolling moment output is the output 
variable 



Angle of 
attack (°) 


Stream 
velocity (m/s) 


Actuation 
angle (°) 


Rolling 

moment output 




20 


10 


40 


-0.00485 


20 


10 


60 


0.00092 


20 


10 


80 


0.00026 


20 


10 


100 


-0.00621 


20 


10 


120 


0.00011 


20 


10 


140 


-0.00626 


20 


15 


40 


-0.01179 


20 


15 






20 


15 


140 


-0.00361 





To solve this problem, existing methods need to be extended for this dataset. Note 
that predicting the output based on a set of inputs is common in many physical 
systems. Therefore, the technique presented in this paper is general in nature. 



3 Discovering Rules on the Basis of Sequence Clustering 

The basic idea of our technique is as follows. The output value may not be 
determined based on a subset of the inputs. However, the output variation in certain 
ranges of selected inputs may follow certain sequence patterns. We shall first extract 
such sequence patterns from the raw data. A sequence clustering hierarchy can be 
built in a bottom-up fashion based on inter-cluster errors (ice). Such a hierarchy 
provides cluster candidates. A weighted variance (wvar) measure is used to describe 
the sequences closeness within each candidate. The clustering is finalized by 
selecting candidate clusters whose wvars are below a user-specified threshold. 
Sequences in such clusters are considered similar and approximated by the 
corresponding sequence mean. Rules can be then derived on each cluster to represent 
the input-output relationship. 
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3.1 Definition of a Seqnence 

Consider a dataset D with input variables X^, X^, an output variable Y and a 

predicate p defined on the inputs. A sequence of Y w.r.t Xf 1< i< n) characterized by 
p is a set of 2 -tuples: x, >, <>" 2 ’ A A calculated by 

jj, y^, X;, X; are specific values of Y and X., 

respectively. Without losing generality, we can assume x, -x, - ■ For 

example, Table 1 contains a sequence of the rolling moment w.r.t the actuation angle: 
{<-0.00485, 40>,<0.00092, 60>, <0.00026, 80>, <-0.00621, 100>, <0.00011, 120>, 
<-0.00626, 140>} characterized by “angle of attack = 20 AND stream velocity = 10”. 

Note that this definition is slightly different from those in existing research, e.g., 
[12, 13], where the time variable is implicitly used as X. in our definition. 



3.2 Clustering Hierarchy Generation 



Since sequences are objects with no total order, we use a bottom-up clustering 
strategy, MDC [15], to build a hierarchical cluster over the sequence set. Given s 
sequences, s initial clusters are built each containing one sequence. The algorithm 
merges two closest clusters at each step and finishes constructing a binary-tree after 
the j- 1 |^ iteration. 

Let us now apply this clustering strategy to the experimental data: 

1. Extract sequences of the rolling moment output w.r.t the actuation angle (Table 1). 
Such sequences are characterized by predicates in the form: “angle of attack = a 
AND stream velocity = v”, where a and v range from [5°, 10°, 15°, 20°, 25°, 30°, 
35°) and [ 10 m/s, 15 m/s, 20 m/s], respectively. 

2. In order to discover more frequently occurred patterns from these sequences, we 
normalize on the output variable so that for a particular angle of attack and stream 
velocity the difference between the maximum and the minimum output is 1 . 

3. Euclidean distance is used as the distance measure between two sequences S- and 

m 

k=l 



Here m is the length of each sequence, while 5 '. , Sj (l<k<m) are the output 
values in sequence S- and respectively. 

4. An inter-cluster error {ice) measure [15] is used to calculate the distance between 
two sequence clusters Q and Cj (IGI denotes the size of C)\ 



ice{Cy,C2) = 



_L l_ 

I Q 1 1 Cj I 



IC,I IC2I 

t=l 7=1 



.1, 5,eC,. SjSC, 



( 2 ) 



The resulting clustering hierarchy is shown in Eigure 3. Each leaf is a sequence 
characterized by the corresponding label. Eor example, the leaf “angle of attack = 20° 
AND stream velocity = 10” represents the sequence [<-0.00485, 40>, <0.00092, 60>, 
<0.00026, 80>, <-0.00621, 100>, <0.00011, 120>, <-0.00626, 140>). 
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3.3 Cluster Selection Based on Weighted Variance 



Each branch node in the generated hierarchy represents a candidate cluster. We shall 
introduce the notion of weighted variance iwvar) to measure the closeness within a 
candidate cluster. Candidates with wvars lower than a specified threshold will be 
chosen as final clusters. For a cluster C = {S,, S^, ..., S,}, wvar(C) should be 
proportional to the cluster’s standard deviation: 






, where is the mean of .... S,. 



(3) 



For two clusters with the same standard deviation but different amplitude ranges, 
we introduce the amplitude measure amp{C) to provide weighted preference based on 
a cluster C’ s amplitude range: 

amp(C) = max(max s )-min(minj. ), 

l<i<l l<k<m ‘ l<i<l l<k<m ‘ W 



where m is the length of each sequence and (1^ I, 1^ k< m) is the output 
value in sequence S-. 

Therefore, we define the weighted variance for a cluster C as: 

wvar(C) = — — — , if amp(C) ^ 0; otherwise 0 (5) 

amp(C) 

Each branch node in Figure 3 represents a candidate cluster and is labeled by that 
cluster’s wvar. All the sequences in a branch with wvar below certain user-specified 
threshold are considered similar. A smaller wvar threshold yields smaller cluster 
sizes and more accurate approximation by the sequence mean. By setting such a 
threshold as “wvar < 0.32”, the final clusters are chosen as Figure 4. 



aoa: angle of attack 
vel: stream velocity 



ivva^O.791 632 

wvar 0.76361 aoa 20 vel 1 0 

ivvarO.37703 wvar 0.683673 

wvarO. 319604 aoa 25 veils ivvarO.53591 wvar 0.525035 

ivvarO.166938 aoa 25 vel 20 ivvar 0.427834 ivvarO.470347 aoa 30 vel 15 

aoa20vell5 aoa 20 vel 20 wvar 0.285693 aoa 5 vel 20 wvarO.362311 ivvarO.30^65^ 



ivvarO.169453 aoa 10 vel 10 



wvarO.238688 aoa 30 vel 20 






aoa 10 vel 15 aoa10vel20 



wvarO. 304964 aoa 15 vel 10 aoa 25 vel 10 aoa 30 vel 10 wvarO.243029 



ivvarO.147095 aoa 35 vel 20 



wvar 0.18656 wvar 0.207435 

aoa5vel10 aoa5vel15 aoa 15 vel 15 aoa15vel20 aoa35vel10 aoa35vel15 



Fig. 3. Clustering of sequences extracted from the experimental flight data 



Based on the results in Figure 4, traditional mining methods such as rule induction 
can be applied to generate more complete knowledge. 
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n/var0.791632 



tvparO. 37703 






^vvar 0.683673 




iivvar 0 
aoa 20 
vel 10 


cluster #1 
tvparO.319604 
aoa 20 vel 1 5 


cluster #2 
wvarO 
aoa 25 


wvar 0.53591 
/'^arO. 427834 






wvar 0.525035 


aoa 20 vel 20 


vel 1 5 


y 






tvvarO. 470347 


cluster #9 




cluster #3 
wvar 0.285693 
aoa 1 0 vel 1 0 
aoa 1 0 vel 1 5 
aoa 1 0 vel 20 


cluster #4 
iivvarO 
aoa 5 
vel 20 


cluster #5 
wvarO. 304964 
aoa 5 veil 0 
aoa 5 veil 5 
aoa 1 5 veil 5 
aoa 1 5 vel 20 


cluster #6 
iivvarO 
aoa 1 5 
vellO 


cluster #7 
^vvar 0.302655 
aoa 25 vel 10 
aoa 30 vel 10 
aoa 30 vel 20 


cluster #8 
wvar 0.243029 
aoa 35 vel 1 0 
aoa 35 vel 1 5 
aoa 35 vel 20 


aoa 30 
vel 1 5 



Fig. 4. The pruned clustering hierarchy of Figure 3 with cae < 0.32 



3.4 Rule Derivation from Similar Sequences 

Based on the clustering result, forward inference rules (also referred to as 
classification or discriminant rules [5, 6]) can be derived in the following form; 

IF p THEN mean(cluster #/), wvar(cluster #i), confidence: P[cluster #i I p] (6) 

Here is a predicate defined on the input variables, mea«(cluster #i) is the 
sequence mean calculated on cluster #i, wvar(cluster #t) is the cluster’s weighted 
variance and P[cluster #i I p] is the conditional probability of cluster #i given p. 

To derive such forward inference rules, an algorithm should search over all 
possible input variable predicates and select those predicates that yield rule supports 
and confidences above certain thresholds. Pruning strategies are used in this process 
to reduce the search space. For efficient algorithms on forward inference rule 
generation, see [4, 5]. 

For example, to generate rules on cluster #5 (Figure 4), we set the minimum 
support as “2” and minimum confidence as “60%”. The forward inference rules 
generated are: 

IF angle of attack=5° THEN mcan(cluster #5), wvar 0.304964, confidence 66.7%. 
IF angle of attack=15° THEN mean(cluster #5), wvar 0.304964, confidence 66.7% 

Figure 5(a) displays the rolling moment output values of the four sequences in 
cluster #5. Figure 5(b) shows the corresponding sequence mean. 

Similarly, the following rule can be generated from cluster #8: 

IF angle of attack=35° THEN mean(cluster #8), wvar 0.243029, confidence 100%. 

The sequences and mean of cluster #8 are shown in Figure 6(a) and Figure 6(b), 
respectively. 

3.5 Application of Derived Rules 

Since the clustering result summarizes the raw data, we can derive rules on all the 
clusters. The resulting rule set gives much better coverage over the entire case space, 
and therefore is more useful for flight control. 
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actuation angle 



Fig. 5. (a) four sequences in cluster #5 
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Fig. 6. (a) three sequences in cluster #8 
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Currently, forward inference rules can be used to predict force balance outputs 
based on input settings. An average sequence is first predicted using angle of attack 
and/or stream velocity. For example, given an angle of attack at 35°, the sequence in 
Figure 6(b) is selected. Using this sequence, the rolling moment output can be 
determined at each actuation angle. Note that the sequences have been normalized 
before clustering (see Sect. 3.2). The output value should be multiplied by the 
corresponding normalization factor. 

We are planning a series of wind tunnel experiments to learn the dynamic 
characteristics of the delta wing aircraft. Not only the force balance outputs but also 
the aircraft’s moving dynamics (e.g. speed and acceleration) will be recorded. The 
augmented dataset will allow us to derive rules providing more insight into flight 
dynamics. For example, rules can be generated predicting the variation of the rolling 
speed with respect to the actuation setting and/or the roll angle. Such rules can guide 
us to select the proper actuation schema to achieve a desirable control effect. 



4 Conclusion and Future Work 

Traditional mining methods fail to derive sufficient input-output relationship for 
predicting physical system behavior. In this paper, we propose a novel knowledge 
discovery technique based on sequence patterns. A sequence is defined as the output 
variation in certain ranges of selected inputs. A sequence clustering hierarchy can be 
built in a bottom-up fashion, using inter-cluster error (ice) as the distance measure. 
Based on the hierarchy, similar sequences are grouped in to clusters. The sizes of 
these clusters are controlled by the weighted variance (wvar) measure. Each cluster is 
represented by the sequence mean of that cluster. Forward inference rules are then 
derived on each cluster to represent the input-output relationship. We have applied 
this technique to the wind tunnel experimental data and derive useful knowledge for 
MEMS -based aircraft control. 
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From the experiment design aspect, we plan to expand the current wind tunnel 
experiments to include dynamic behaviors. Such experimental data will allow us to 
mine input-output relationship under dynamic environments. From the algorithm 
development aspect, the current sequence definition needs to be extended to include 
multiple input and output variables, which will widen the scope of frequent patterns. 
Further, we need to extend the proposed sequence clustering and rule derivation 
technique to future augmented datasets and reduce the computation complexity. 
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Appendix 

Conventional data mining research mostly focused to datasets with only categorical 
variables. To apply existing methods on this particular dataset, we need to first 
discretize the output variable that takes continuous values. Such discretization 
methods are discussed in [6]. The basic idea is to leverage on a concept hierarchy 
generated manually by experts or automatically from the data distribution. The 
Distance Sensitive Clustering (DISC) method is used [2] to build a Type Abstraction 
Hierarchy (TAH) for the output variable. Each node in such a hierarchy corresponds 
to a value range. The whole data range can be partitioned using the ranges of nodes at 
a certain level. For example, the rolling moment output data in Table 1 can be 
partitioned into six clusters: [-0.01179, -0.00380], [-0.00380, -0.00138], [-0.00138, - 
0.00024], [-0.00024, 0.00028], [0.00028, 0.00124], [0.00124, 0.00537]. 

In the following sections, we apply two common methods to classify the rolling 
moment output: decision tree and association rule induction. 



Decision Tree 

The dataset was first run on a decision tree generation algorithm provided by IBM 
Intelligent Miner [11]. The class variable is the discretized rolling moment output. 
The input variables include angle of attack, stream velocity and actuation angle. The 
tool generated a four-level decision tree after pruning. To test the effectiveness of the 
result, the tree was directly applied back to predict the training dataset. Table 2 shows 
the prediction confusion matrix. The number in the (1,^, table entry represents the 
percentage of tuples that belongs to the class yet predicted as the class. 

The high error rate attributes to the high dependency of the output variable on all 
the input variables. A single input variable has low predictive power on the output 
when taken alone. Therefore, univariate splitting, the basic philosophy behind 
decision tree induction, makes the method unsuitable for this kind of dataset. 



Table 2. Confusion matrix based on the pruned decision tree. Error rate: 46.35% 



— — — ^;e^ted Classes 
Actual Classes' — — 


[-0.01179, 

-0.00380] 


[-0.00380, 

-0.00138] 


[-0.00138, 

-0.00024] 


[-0.00024, 

0.00028] 


[0.00028, 

0.00124] 


[0.00124, 

0.00537] 


Total 




6.25% 


0% 


0% 


1.56% 


0% 


1.04% 


8.85% 


[-0.00380, -0.00138] 


1.56% 


0% 


0% 


6.77% 


0% 


1.56% 


9.90% 


[-0.00138, -0.00024] 


2.60% 


0% 


0% 


10.42% 


0% 


2.60% 


15.62% 


[-0.00024, 0.00028] 


3.02% 


0% 


0% 


38.54% 


0% 


0.52% 


42.18% 


[0.00028,0.00124] 


4.17% 


0% 


0% 


6.25% 


0% 


1.04% 


11.46% 


[0.00124, 0.00537] 


1.04% 


0% 


0% 


2.08% 


0% 


8.85% 


11.98% 


Total 


18.75% 


0% 


0% 


65.62% 


0% 


15.62% 


100% 








280 Z. Liu et al. 



Association Rules 

Our second effort was to run the dataset on association rule derivation methods. 
Since we are concerned about using input variables to predict the output variable, we 
concentrate only on rules that have input variables in their left-hand-sides and the 
output variable as their right-hand-sides. The Apriori algorithm [1] has been tested on 
the dataset after discretization. The minimum support and confidence are set to 3% 
and 70%, respectively. All the resulting rules that satisfy the above restriction are 
listed in table 3. 



Table 3. Rules generated by Apriori associate the output variable with the input variables 



# 


Support 

(%) 


Confidence 

(%) 


Rule body 


Rule head 

i.e. the rolling moment 




11.9792 


95.8300 


angle of attack = 0 


f-0.00024, 0.000281 


2 


4.1667 


100.0000 


angle of attack = 0 AND stream velocity = 10 


f-0.00024, 0.000281 


3 


3.6458 


87.5000 


angle of attack = 0 AND stream velocity =15 


1-0.00024, 0.000281 


4 


4.1667 


100.0000 


angle of attack = 0 AND stream velocity = 20 


1-0.00024, 0.000281 


5 


9.8958 


79.1700 


angle of attack = 5 


1-0.00024, 0.000281 


6 


3.1250 


75.0000 


angle of attack = 5 AND stream velocity =10 


1-0.00024, 0.000281 


7 


3.1250 


75.0000 


angle of attack = 5 AND stream velocity =15 


1-0.00024, 0.000281 


8 


3.6458 


87.5000 


angle of attack = 5 AND stream velocity = 20 


1-0.00024, 0.000281 


9 


3.1250 


75.0000 


stream velocity = 10.00 AND actuation angle = 0 


1-0.00024, 0.000281 



The knowledge provided by those rules suffer from the following shortcomings: 

1. Low coverage. The nine rules in Table 3 cover only 28.125% of the original 
dataset, whereas 71.875% of the cases encountered cannot be predicted. Due to the 
low coverage over the entire case space, this rule set cannot provide sufficient 
information about the input-output relationship. Thus, it is insufficient for flight 
control. 

2. Unable to handle control-sensitive regions. When the angle of attack is above 
15°, the output variable is more sensitive to the inputs. That is, the output in this 
region has larger magnitudes and greater variances. However, the rules (Table 3) 
derived by Apriori are mostly in the insensitive region (i.e. angle of attack below 
15°) since data in this region is less variant and tends to give higher rule supports 
and confidences. 

The reason that association rules fail to capture the sensitive region is due to the basic 
rule form: “ IF Xj = x, AND . . . AND A, = x, THEN Y = y”. Here X^, ... , A, are input 
variables and Y is the output. For a dataset with n input variables, t is usually less 
than n. Otherwise a rule simply reiterates a tuple in the dataset. However, a t less 
than n means omitting certain input variables. In the sensitive region, omitting any 
input variable in the left-hand-side may be disastrous since the right-hand- side cannot 
be concentrated in one category. This is best illustrated by the real data shown below. 



Thus, using conventional rule induction results in the following dilemma: rules 
generated either reiterate the original tuples, or have undesirably low confidences. 
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Table 4. Droping any one variable in “angle of attack = 20° AND stream velocity = 15 AND 
actuation angle = 60°” generates three 2-variable combinations. Each of these 2-variable 
combinations corresponds to a sub table listed below. From these tables, we note that no rules 
containing only two variables in the left-hand-side, e.g.'TF angle of attack = 20° AND stream 
velocity =15 THEN Rolling moment = ...”, will have a big confidence measure 



Angle 

of 

attack 


Stream 

velocity 

(m/s) 


Actuation 

angle 

(°) 


Rolling 

moment 


20 


15 


40 


[-0.01179, -0.00380] 


20 


15 


60 


[-0.00380, -0.00138] 


20 


15 


80 


[-0.01179, -0.00380] 


20 


15 


100 


[0.00124, 0.00537] 


20 


15 


120 


[0.00028,0.00124] 


20 


15 


140 


[-0.00380, -0.00138] 



a) angle of attack = 20 AND stream velocity = 15 



Angle 

of 

attack 

(°) 


Stream 

velocity 

(m/s) 


Actuation 

angle 

(°) 


Rolling 

moment 


20 


10 


60 


[0.00028,0.00124] 


20 


15 


60 


[-0.00380, -0.00138] 


20 


20 


60 


[-0.01179, -0.00380] 



c) angle of attack = 20 AND actuation angle = 60 



Angle of 
attack (®) 


Stream 

velocity 

(m/s) 


Actuation 

angle 

(°) 


Rolling 

moment 


0 


15 


60 


[0.00028, 0.00124] 


5 


15 


60 


[-0.00024, 0.00028] 


10 


15 


60 


[-0.00024, 0.00028] 


15 


15 


60 


[0.00124, 0.00537] 


20 


15 


60 


[-0.00380, 0.00138] 


25 


15 


60 


[0.00028, 0.00124] 


30 


15 


60 


[0.00124, 0.00537] 


35 


15 


60 


[0.00124, 0.00537] 



b) stream velocity =15 AND actuation angle = 60 
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Abstract. Data clustering methods have many applications in the area of data 
mining. Traditional clustering algorithms deal with quantitative or categorical 
data points. However, there exist many important databases that store 
categorical data sequences, where significant knowledge is hidden behind 
sequential dependencies between the data. In this paper we introduce a problem 
of clustering categorical data sequences and present an efficient scalable 
algorithm to solve the problem. Our algorithm implements the general idea of 
agglomerative hierarchical clustering and uses frequently occurring 
subsequences as features describing data sequences. The algorithm not only 
discovers a set of high quality clusters containing similar data sequences but 
also provides descriptions of the discovered clusters. 



1 Introduction 

Clustering is one of the most popular unsupervised data analysis methods that aims at 
identifying groups of similar objects based on the values of their attributes [14] [15]. 
Many clustering techniques have been proposed in the area of machine learning 
[7] [12] [14] and statistics [15]. Those techniques can be classified as partitional and 
hierarchical. Partitional clustering obtains a partition of data objects into a given 
number of clusters optimizing some clustering criterion. Hierarchical clustering is a 
set of partitions forming a cluster hierarchy. An agglomerative hierarchical clustering 
starts with clusters containing single objects and then merges them until all objects are 
in the same cluster. In each iteration two most similar clusters are merged. Divisive 
hierarchical clustering starts with one cluster and iteratively divides it into smaller 
pieces. 

Emerging data mining applications place additional requirements on clustering 
techniques, namely: scalability with database sizes, effective treatment of high 
dimensionality and interpretability of results [1]. Recently, the problem of data 
clustering has been redefined in the data mining area. The concept of cluster mining 
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[18] is used to represent a method which analyzes very large data sets to efficiently 
identify a small set of high-quality (statistically strong) clusters of data items. Cluster 
mining does not aim at partitioning of all the data items - instead, less frequent noise 
and outliers are ignored. In other words, cluster mining finds only the highest density 
areas hidden in the data space. 

A number of efficient and scalable clustering algorithms for clustering data points 
represented by multidimensional quantitative values [1] [6] [10] [22], as well as by sets 
of categorical values (product names, URLs, etc.) [8][9][1 1][13][19][21] have been 
proposed so far in the data mining area. However, we notice that there exist many 
important databases that store categorical data sequences, where significant 
knowledge is hidden behind sequential dependencies between the data (credit card 
usage history, operating system logs, database redo logs, web access paths, etc.). The 
existing clustering algorithms either cannot be easily transformed to deal with 
categorical data sequences or cannot take into account the sequential dependencies. 

Categorical data sequence clustering can have many applications in the area of 
behavioral segmentation - e.g. web users segmentation [18]. The problem of web 
users segmentation is to use web access log files to partition a set of users into 
clusters such that the users within a cluster are more similar to each other than users 
from different clusters. The discovered clusters can then help in on-the-fly 
transformation of the web site content. In particular, web pages can be automatically 
linked by additional hyperlinks. The idea is to try to match an active user's access 
pattern with one or more of the clusters discovered from the web log files. Pages in 
the matched clusters that have not been explored by the user may serve as 
navigational hints for the user to follow. 

In this paper we define the problem of clustering categorical data sequences and we 
propose an efficient algorithm to solve the problem. The algorithm employs the idea 
of agglomerative hierarchical clustering, which consists in merging pairs of similar 
clusters to form new larger clusters. We have taken the following assumptions: 1. the 
simplest cluster (elementary cluster) is a set of data sequences containing a common 
subsequence, 2. a significant cluster is a cluster that contains a large number of data 
sequences, 3. two clusters can be merged if a large number of their corresponding 
sequences overlap. Our algorithm starts with a set of significant elementary clusters 
and merges them iteratively until a user defined stop condition is satisfied. 

Let us illustrate the approach with the following example (Fig. 1). We are given a 
web log file, which records paths used by users for navigation (e.g. the user si has 
visited the URLs: A, B, C, Z, and then D). Assume we are interested in discovering 
groups of users (sequences), whose behavior is similar to each other, i.e. who visit 
identical pages in the identical order. First, we create the elementary clusters cl, c2, 
c3, c4 that contain overlapping sequences. Then we notice that we can merge the 
clusters cl, c2, and c3 since the sequences they contain overlap between the clusters. 
Finally, the algorithm ends with two clusters, which represent web users of similar 
behavior. 

The paper is organized as follows. Section 2 discusses related work. In Section 3, the 
basic definitions and the formulation of the problem are given. Section 4 contains the 
problem decomposition and the description of the algorithm for pattern-oriented 
clustering. Experimental results concerning the proposed clustering method are 
presented in Section 5. We conclude with a summary in Section 6. 
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si: A->B->C->Z->D 
s2: P->Q->R 
s3: C->D->E->F 



s4: L->M->N->0 
s5: E->F->G->A->B 
s6: Q->X->R ->S 



cl: 


c2: 


c3: 


c4: 


si: A->B->C->Z->D 


s3: C->D->E->F 


si: A->B->C->Z->D 


s2: P->Q->R 


s3: C->D->E->F 


s5: E->F->G->A->B 


s5: E->F->G->A->B 


s6: Q->X->R ->S 



cl: 


c4: 


si: A->B->C->Z->D 


s2: P->Q->R 


s3: C->D->E->F 


s6: Q->X->R ->S 


s5: E->F->G->A->B 





Fig. 1. Behavioral segmentation example 



2 Related Work 

Many clustering algorithms have been proposed in the area of machine learning 
[7] [12] [14] and statistics [15]. Those traditional algorithms group the data based on 
some measure of similarity or distance between data points. They are suitable for 
clustering data sets that can be easily transformed into sets of points in n-dimensional 
space, which makes them inappropriate for categorical data. 

Recently, several clustering algorithms for categorical data have been proposed. In 
[13] a method for hypergraph-based clustering of transaction data in a high 
dimensional space has been presented. The method used frequent itemsets to cluster 
items. Discovered clusters of items were then used to cluster customer transactions. 
[9] described a novel approach to clustering collections of sets, and its application to 
the analysis and mining of categorical data. The proposed algorithm facilitated a type 
of similarity measure arising from the co-occurrence of values in the data set. In [8] 
an algorithm named CACTUS was presented together with the definition of a cluster 
for categorical data. In contrast with the previous approaches to clustering categorical 
data, CACTUS gives formal descriptions of discovered clusters. 

In [21] the authors replace pairwise similarity measures, which they believe are 
inappropriate for categorical data, with a clustering criterion based on the notion of 
large items. An efficient clustering algorithm based on the new clustering criterion is 
also proposed. 

The problem of clustering sequences of complex objects was addressed in [16]. 
The clustering method presented there used class hierarchies discovered for objects 
forming sequences in the process of clustering sequences seen as complex objects. 
The approach assumed applying some traditional clustering algorithm to discover 
classes of sub-objects, which makes it suitable for sequences of objects described by 
numerical values, e.g. trajectories of moving objects. 

The most similar approach to ours is probably the approach to document clustering 
proposed in [5]. The most significant difference between their similarity measure and 
ours is that we look for the occurrence of variable-length subsequences and 
concentrate only on frequent ones. 

Our clustering method can be seen as a scalable version of a traditional 
agglomerative clustering algorithm. Scaling other traditional clustering methods to 
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large databases was addressed in [4], where the scalable version of the K-means 
algorithm was proposed. 

Most of the research on sequences of categorical values concentrated on the 
discovery of frequently occurring patterns. The problem was introduced in [2] and 
then generalized in [20]. The class of patterns considered there, called sequential 
patterns, had a form of sequences of sets of items. The statistical significance of a 
pattern (called support) was measured as a percentage of data sequences containing 
the pattern. 

In [17] an interesting approach to sequence classification was presented. In the 
approach, sequential patterns were used as features describing objects and standard 
classification algorithms were applied. To reduce the number of features used in the 
classification process, only distinctive (correlated with one class) patterns were taken 
into account. 



3 Pattern- Oriented Agglomerative Hierarchical Clustering 

Traditional agglomerative hierarchical clustering algorithms start by placing each 
object in its own cluster and then iteratively merge these atomic clusters until all 
objects are in a single cluster. Time complexity of typical implementations vary with 
the cube of the number of objects being clustered. Due to the poor scalability, 
traditional hierarchical clustering cannot be applied to large collections of data, such 
as databases of customer purchase histories or web server access logs. Another very 
important issue that has to be addressed when clustering large data sets is automatic 
generation of conceptual descriptions of discovered clusters. Such descriptions should 
summarize clusters’ contents and have to be comprehensible to humans. 

To improve performance of hierarchical clustering for large sets of sequential data, 
we do not handle sequences individually but operate on groups of sequences sharing a 
common subsequence. We concentrate only on the most frequently occurring 
subsequences, called frequent patterns (sequential patterns). We start with initial 
clusters associated with frequent patterns discovered in the database. Each of the 
initial clusters (clusters forming the leaf nodes of the hierarchy built by the clustering 
process) consists of sequences containing the pattern associated with the cluster. 
Clusters being results of merging of smaller clusters are described by sets of patterns 
and consist of sequences that contain at least one pattern from the describing set. 

Definition 3.1. Let L = [/,, l^, ..., /„] be a set of literals called items. A sequence S = 
<X, ... X> is an ordered list of sets of items such that each set of items X. c L. Let 

the database D be a set of sequences. 

Definition 3.2. We say that the sequence 5, = <T, ... Y> supports the sequence 

= <X, X^ ... X> if there exist integers < ... < such that X^cY.,, X^c Y.^, ..., X^ 
c Y.^. We also say that the sequence is a subsequence of the sequence S, (denoted 
by 5, <=S,). 
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Definition 3.3. A frequent pattern is a sequence that is supported by more than a user- 
defined minimum number of sequences in D. Let L" be a set of all frequent patterns in 



D. 



Definition 3.4. A cluster is an ordered pair <Q,S>, where Q c P and 5 c D, and 5 is a 
set of all database sequences supporting at least one pattern from Q. We call Q a 
cluster description, and S a cluster content. We use a dot notation to refer to a 
description or a content of a given cluster (c. Q represents the description of a cluster c 
while c.S represents its content). 

Definition 3.5. A cluster c is called an elementary cluster if and only if \ c.Q \ = \. 

Definition 3.6. A union c^^ of the two clusters c^ and is defined as follows: 

= union(c^, cj= < c^.Qu c^.Q , c^.S'U c,,.S >. 



Elementary clusters form leaves in the cluster hierarchy. The clustering algorithm 
starts with the set of all elementary clusters. Due to the above formulations, input 
sequences that do not support any frequent pattern will not be assigned to any 
elementary cluster and will not be included in the resulting clustering. Such sequences 
are treated as outliers. The above definitions also imply that in our approach clusters 
from different branches of the cluster hierarchy may overlap. We do not consider it to 
be a serious disadvantage since sometimes it is very difficult to assign a given object 
to exactly one cluster, especially when objects are described by categorical values. In 
fact, if two clusters overlap significantly, then it means that patterns describing one of 
the clusters occur frequently in sequences contained in the other cluster and vice 
versa. This means that such clusters are good candidates to be merged to form a new 
larger cluster. The measures of similarity between clusters we propose in the paper 
are based on this observation. The cluster similarity measures we consider in this 
paper are based on the co-occurrence of the frequent patterns. 

Definition 3.7. The co-occurrence of two frequent patterns p, and p^ is a Jaccard 
coefficient [14] applied to the sets of input sequences supporting the patterns: 



co(/7^,/?2) = 



{ Sj e D : S; 3 pj a S; 3 pj) 

I Sj G D : Sj 3 Pj V Sj 3 Pj) 



( 1 ) 



The similarity of two elementary clusters is simply the co-occurrence of patterns from 
their descriptions. The first of our inter-cluster similarity measures for arbitrary 
clusters is the extension of the above pattern co-occurrence measure i.e. similarity 
between two clusters and c^is a Jaccard coefficient applied to cluster contents: 









c .S r\c, .S 
a b 

c .S yj c, .S 
a b 



( 2 ) 
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The second inter-cluster similarity measure we consider in the paper is defined as the 
average co-occurrence of pairs of patterns between two clusters’ descriptions 
(group-average similarity): 

^ " ci^g(co{Pi , Pj ), where : p, ^c^.Qa p- e .Q) ( 3 ) 

Traditional hierarchical clustering builds a complete cluster hierarchy in a form of a 
tree. We add two stop conditions that can be specified by a user: the required number 
of clusters and the inter-cluster similarity threshold for two clusters to be merged. The 
first stop condition is suitable when a user wants to obtain the partitioning of the data 
set into the desired number of parts, the second is provided for cluster mining, which 
identifies high quality clusters. 

Problem Statement. Given a database D = s^, i'J of data sequences, and a set 

P = {/?,, p^, pji of frequent patterns in D, the problem is to build a cluster hierarchy 
starting from elementary clusters as leaves, iteratively merging the most similar 
clusters until the required number of clusters is reached or there are no pairs of 
clusters exceeding the specified similarity threshold. 



4 Algorithms 

In this section, we describe a new clustering algorithm POPC for clustering large 
volumes of sequential data (POPC stands for Pattern-Oriented Partial Clustering). The 
algorithm implements the general idea of agglomerative hierarchical clustering. As 
we mentioned before, instead of starting with a set of clusters containing one data 
sequence each, our algorithm uses previously discovered frequent patterns and starts 
with clusters containing data sequences supporting the same frequent pattern. We 
assume that a set of frequent patterns has already been discovered and we do not 
include the pattern discovery phase in our algorithm. The influence of the pattern 
discovery process on the overall performance of our clustering method is described in 
the next section. 

The POPC algorithm is database-oriented. It assumes that the input data sequences 
and the contents of clusters to be discovered are stored on a hard disk, possibly 
managed by a standard DBMS. Only the structures whose size depends on the number 
of patterns used for clustering and not on the number of input data sequences are 
stored in the main memory. These structures are similarity and co-occurrence 
matrices and cluster descriptions. 

We introduce two variants of our algorithm based on two different cluster 
similarity measures: POPC-J using the Jaccard coefficient of the clusters’ contents, 
and POPC-GA using the group average of co-occurrences of patterns describing 
clusters. First we present the generic POPC algorithm and then we describe elements 
specific to particular variants of the algorithm. 
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4.1 Generic POPC Algorithm 

The algorithm for partial clustering based on frequently occurring patterns is 
decomposed into two following phases: 

• Initialization Phase, which creates the initial set of elementary clusters and 
computes the co-occurrence matrix between patterns which serves as the 
similarity matrix for the initial set of clusters, 

• Merge Phase, which iteratively merges the most similar clusters. 

4.1.1 Initialization Phase 

In this phase, the initial set of clusters C, is created by mapping each frequent pattern 
into a cluster. During the sequential scan of the source database, the contents of initial 
clusters are build. At the same time the co-occurrence matrix for the frequent patterns 
is computed. This is the only scan of the source database required by our algorithm. 
(It is not the only place where access to disk is necessary because we assume that 
cluster contents are stored on disk too.) Figure 2 presents the Initialization Phase of 
the clustering algorithm. 

Cl = {ci: Ci.Q={pi}, Ci.S = 0}; 

UNION [] [] = {O}; INTERSECT!] [] = { 0 } ; 

for each SjGD do 

begin 

for each PiGP do 

if Sj supports Pi then 
Cl . S = Cl . S U { Sj } ; 

end if; 

for each Pi, Pk eP do 

if Sj supports Pi or Sj supports Pk then 
UNION [i] [k]++; UNION [k] [i]++; 
if Sj supports Pi and Sj supports Pk then 
INTERSECT [i] [k]++; INTERSECT [k] [i]++; 

end if; 
end if; 

end; 

for each Pi, Pk eP do 

CO[i] [k] = INTERSECT [i] [k] / UNION [i] [k] ; 

Ml = CO; 

Fig. 2. Initialization phase 

To compute the pattern co-occurrence matrix CO, for each pair of patterns we 
maintain two counters to count the number of sequences supporting at least one of the 
patterns and both of the patterns respectively. Those counters are represented by 
temporary matrices UNION and INTERSECT, and are used to evaluate the 
coefficients in the matrix CO after the database scan is completed. The similarity 
matrix M, for the initial set of clusters C, is equal to the pattern co-occurrence matrix. 
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4.1.2 Merge Phase 

Figure 3 presents the Merge Phase of the clustering algorithm. This phase of the 
algorithm iteratively merges together pairs of clusters according to their similarity 
values. In each iteration k, the two most similar clusters c ,c. e C. are determined, and 
replaced by a new cluster = union(c^cJ. The actual merging is done by the 
function called cluster, described in detail in Section 4.1.4. When the new cluster is 
created, the matrix containing similarity values has to be re-evaluated. This operation 
is performed by means of the function called simeval, described in Section 4.1.3. 

k = 1; 

while |C]j| > n and exist Ca,Cb g Cj^ 
such that f(Ca,Cb) > min_sim do begin 
Ck+i = cluster (Ck, Mk) ; 

Mk+i = simeval (Ck+i, Mk) ; 
k++ ; 
end; 

Answer = Ck; 

Fig. 3. Merge phase 

The Merge Phase stops when the number of clusters reaches n (the required number 
of clusters) or when there is no such pair of clusters c^,c^ e Q whose similarity is 
greater than min_sim (the similarity threshold). 

4.1.3 Similarity Matrix Evaluation: Simeval 

Similarity matrix M, stores the values of the inter-cluster similarity function for all 
possible pairs of clusters in the /-th algorithm iteration. The cell MJx][y] represents 
the similarity value for the clusters and from the cluster set C,. The function 
simeval computes the values of the similarity matrix using both the similarity 
matrix M, and the current set of clusters. Notice that in each iteration, the similarity 
matrix need not be completely re-computed. Only the similarity values concerning the 
newly created cluster have to be evaluated. Due to diagonal symmetry of the 
similarity matrix, for k clusters, only (k-1) similarity function values need to be 
computed in each iteration. 

In each iteration, the size of the matrix decreases since two rows and two columns 
corresponding to the clusters merged to form a new one are removed and only one 
column and one row are added for a newly created cluster. 

4.1.4 Cluster Merging: Cluster 

In each iteration, the number of processed clusters decreases by one. The similarity- 
based merging is done by the function called cluster. The function cluster scans the 
similarity matrix and finds pairs of clusters, such that their similarity is maximal. If 
there are many pairs of clusters that reach the maximal similarity values, then the 
function cluster selects the one that was found as first. The function cluster takes a set 
of clusters as one of its parameters and returns a set of clusters C^,., such that = 
(Q \ cj) u {c^j}, where c^,c^ g are clusters chosen for merging and = 
union(c^,Ci). 
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4.2 Algorithm Variants POPC-J and POPC-GA 

The POPC-J version of the algorithm optimizes the usage of main memory. The 
pattern co-occurrence matrix is used only in the Initialization Phase as the initial 
cluster similarity matrix. It is not needed in the Merge Phase of the algorithm because 
the similarity function based on Jaccard coefficient does not refer to the co- 
occurrences of patterns describing clusters. Thus the co-occurrence matrix CO 
becomes the initial cluster similarity matrix and no copying is done, which reduces 
the amount of main memory used by the algorithm. 

In case of the POPC-GA version of the algorithm the initial cluster similarity 
matrix is created as a copy of the pattern co-occurrence matrix CO, since the latter 
is used in each iteration of the Merge Phase to compute the similarity between the 
newly created cluster and the rest of clusters. The advantage of this version of the 
POPC algorithm is that it does not use clusters’ contents to evaluate similarity 
between clusters in the Merge Phase (only clusters’ descriptions and the pattern 
co-occurrence matrix are used). Due to this observation, the POPC-GA does not build 
clusters’ contents while the Merge Phase progresses. This is a serious optimization as 
compared to the POPC-J algorithm which has to retrieve clusters’ contents from disk 
to re-evaluate the similarity matrix. Contents of discovered clusters can be built in one 
step after the Merge Phase completes according to the following SQL query (using 
the clusters’ descriptions maintained throughout the algorithm and sets of input 
sequences supporting given patterns, built in the Initialization Phase): 

select distinct d . cluster_id, p . sequence_id 
from CLUSTER_DESCRIPTIONS d , PATTERNS p 
where d.pattern_id = p.pattern_id. 

Each row in the CLUSTER_DESCRIPTIONS table contains information about the 
mapping of one pattern to the description of one cluster, while each row in the 
PATTERNS table contains information that a given data sequence supports a given 
pattern. 



5 Experimental Results 

To assess the performance and results of the clustering algorithm, we performed 
several experiments on a PC machine with Intel Celeron 266MHz processor and 96 
MB of RAM. The data were stored in an OracleSi database on the same PC machine. 
Experimental data sets were created by synthetic data generator GEN from Quest 
project [3]. 

First of all, we compared the sets of clusters generated by the two versions of the 
POPC algorithm. The difference was measured as a percentage of all pairs of 
sequences from all clusters discovered by one version of the algorithm that were not 
put into one cluster by the other version of the algorithm. This measure is asymmetric 
but we believe that it captures the difference between two results of clustering. We 
performed several tests on a small data set consisting of 200 input sequences, using 
about 100 frequent patterns. As a stop condition for both versions of the algorithm we 
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chose the desired number of clusters ranging from 5 to 45. An average difference 
between clustering results generated by the two methods was less than 10%. 

In the next experiment we compared the performance of the two POPC variants 
and tested their scalability with respect to the size of the database (expressed as a 
number of input sequences) and the number of frequent patterns used for clustering. 





Fig. 4. Execution time for different data-base Fig. 5. Execution time for different number 
sizes of frequent patterns used for clustering 

Figure 4 shows the performance of the clustering algorithm for different database 
sizes expressed as the number of sequences in the database. In the experiment, for all 
the database sizes the data distribution was the same, which resulted in the same set of 
patterns used for clustering. Both versions of the algorithm scale linearly with the 
number of source sequences, which makes them suitable for large databases. The key 
factor is that the number of frequent patterns (equal to the number of initial clusters in 
our approach) does not depend on the database size but on the data distribution only. 
The execution time depends linearly on the number of input sequences, because the 
number of sequences supporting a given frequent pattern (for the same support 
threshold) grows linearly as the number of sequences in the database increases. 

Figure 5 illustrates the influence of the number of frequent patterns used for 
clustering on the execution time of our algorithm. The time requirements of the 
algorithm vary with the cube of the number of patterns (the maximal possible number 
of iterations in the Merge Phase is equal to the number of patterns decreased by 1, in 
each iteration the cluster similarity matrix has to be scanned, the initial size of the 
matrix is equal to the square of the number of patterns). We performed experiments 
on a small database consisting of 200 sequences, using from 95 to 380 patterns. The 
experiments show that in practice the algorithm scales well with the number of 
patterns. This is true especially for the POPC-GA version of the algorithm, for which 
the cost of the Initialization Phase dominates the efficient Merge Phase. 

Experiments show that both methods are scalable, but POPC-GA significantly 
outperforms POPC-J thanks to the fact that it does not have to retrieve clusters’ 
contents from the database in the Merge Phase of the algorithm. 

The execution times presented in the charts do not include the time needed to 
discover the set of frequent patterns. The cost of this pre-processing step depends 
strongly on the data distribution and the support threshold for patterns to be called 
frequent. Nevertheless, the time required to discover frequent patterns depends 
linearly on the database size, which preserves the overall linear scalability of the 
clustering method with respect to the database size. In our experiments we used the 
GSP algorithm [20] for pattern discovery. The time required for this step varied from 
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5 to 17 seconds for database sizes from 1000 to 5000 input sequences and the support 
threshold of 6%. This means that the time needed for pattern discovery does not 
contribute significantly to the overall processing time. 



6 Concluding Remarks 

We considered the problem of hierarchical clustering of large volumes of sequences 
of categorical values. We introduced two variants of the algorithm using different 
similarity functions to evaluate the inter-cluster similarity, which is a crucial element 
of the agglomerative hierarchical clustering scheme. Both of the proposed similarity 
measures were based on the co-occurrence of frequent patterns. 

Both versions of the algorithm scale linearly with respect to the size of the source 
database, which is very important for large data sets. Both methods generate similar 
sets of clusters but the POPC-GA variant is much more efficient than POPC-J. 

An important feature of the algorithm is that it does not only discover the clusters 
hut also delivers the description of each cluster in form of patterns that are “popular” 
within the set of sequences forming the cluster. 

In our approach clusters at any level of a cluster hierarchy can overlap. However, 
our method can easily he modified to generate disjoint clusters by using such 
techniques as placing each sequence into a cluster from whose description the 
sequence supports the highest number or percentage of patterns. 
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Abstract. This paper studies the problem of mining frequent sequences 
in transactional databases. In Q, Agrawal and Srikant proposed the 
AprioriAll algorithm for extracting frequently occurring sequences. 
AprioriAll is an iterative algorithm. It scans the database a number 
of times depending on the length of the longest frequent sequences 
in the database. The I/O cost is thus substantial if the database 
contains very long frequent sequences. In this paper, we propose a new 
I/O-efRcient algorithm FFS. Experiment results show that FFS saves 
I/O cost significantly compared with AprioriAll. The I/O saving is 
obtained at a cost of a mild overhead in CPU cost. 

Keywords: data mining, sequence, AprioriAll, FFS 



1 Introduction 

Data mining has recently attracted considerable attention from database practi- 
tioners and researchers because of its applicability in many areas such as decision 
support, market strategy and financial forecasts. Combining techniques from the 
fields of machine learning, statistics and databases, data mining enables us to 
find out useful and invaluable information from huge databases. One of the many 
data mining problems is the extraction of frequent sequences from transactional 
databases. The goal is to discover frequent sequences of events. For example, 
an on-line bookstore may find that most customers who have purchased the 
book “The Gunslinger” are likely to come back again in the future to buy “The 
Gunslinger II” in another transaction. Knowledge of this sort enables the store 
manager to conduct promotional activities and to come up with good marketing 
strategies. 

The problem of mining frequent sequences was first introduced by Agrawal 
and Srikant p. In their model, a database is a collection of transactions. Each 
transaction is a set of items (or an itemset) and is associated with a customer ID 
and a time ID. If one groups the transactions by their customer IDs, and then 
sorts the transactions of each group by their time IDs in increasing value, the 
database is transformed into a number of customer sequences. Each customer 
sequence shows the order of transactions a customer has conducted. Roughly 
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speaking, the problem of mining frequent sequences is to discover “subsequences” 
(of itemsets) that occur frequently enough among all the customer sequences. 

In an algorithm, AprioriAll, was proposed to solve the problem of min- 
ing frequent sequences. AprioriAll is a multi-phase iterative algorithm. It scans 
the database a number of times. Very similar to the structure of the Apriori 
algorithm |0j for mining association rules, AprioriAll starts by finding all fre- 
quent l-sequencet0 from the database. A set of candidate 2-sequences are then 
generated. The frequencies of the candidate sequences are then counted by scan- 
ning the database once. Those frequent 2-sequences are then used to gener- 
ate candidate 3-sequences, and so on. In general, AprioriAll uses a function 
called apriori-generate to generate candidate (fc-|- l)-sequences given the set 
of all frequent fc-sequences. The algorithm terminates when no more frequent 
sequences are discovered during a database scan. 

It is not difficult to see that the number of database scans required by 
AprioriAll is determined by the length of the longest frequent sequences in 
the database. If the database is huge and if it contains very long frequent se- 
quences, the I/O cost of AprioriAll is high. 

The goal of this paper is to analyze and to improve the I/O requirement 
of the AprioriAll algorithm. We propose a new candidate generation function 
FGen. Unlike apriori-generate, which only generates (fc -I- l)-sequences given 
the set of frequent ^-sequences, FGen generates candidate sequences of various 
lengths when provided with a set of frequent sequences of various lengths. Our 
strategy for an I/O-efhdent algorithm (called FFS) goes as follows. First, we 
apply AprioriAll on a small sample of the database to obtain an estimate of 
the set of frequent sequences (L). Next, we scan the database to (i) discover the 
set of all frequent 1-sequences, and (ii) verify which sequences in the estimate 
L are frequent. L is then updated to contain the resulting frequent sequences 
(length 1 and above). After that, FGen is applied to L to obtain a candidate 
sequence set. We then scan the database to determine which candidate sequences 
are frequent. The result is used to update L. We repeat the above procedure of 
candidate-generation- verification until no new frequent sequences are discovered. 

We remark that the initial estimate of the set of frequent sequences (L) can 
be obtained in many different ways. For example, if the database is periodically 
updated, and frequent sequences are mined regularly, the result of a previous 
mining exercise can well be used as L of the next mining exercise. In such a case 
of incremental update, FFS does not even require the sampling phase. 

In later sections, we will prove that FFS is correct and that the set of candi- 
dates generated by FGen is a subset of those generated by apriori-generate. 
We will show that, in many cases, the I/O cost of FFS is significantly less than 
that of AprioriAll. We will show how the performance gain FFS achieves de- 
pends on the accuracy of the initial estimate, L. As an extreme case, FFS requires 
only one or two database scans if L covers all frequent sequences of the database. 
This number is independent of the length of the longest frequent sequences. For 
a database containing long frequent sequences, the I/O saving is significant. 

A fc-sequence is a sequence of k and only k itemsets. 
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The rest of this paper is organized as follows. In Section |3 we review some 
related works. In Section |3 we give a formal definition of the problem of mining 
frequent sequences. In Section ^ we briefly review the AprioriAll algorithm. 
Section 0 presents the FFS algorithm and the FGen function. Experiment results 
comparing the performance of FFS and AprioriAll are shown in Section 0 
Finally, we conclude the paper in Section 0 

2 Related Work 

Agrawal and Srikant Q first studied the problem of mining frequent sequences. 
Among three algorithms they proposed, AprioriAll has the best overall per- 
formance. As we have discussed, the I/O cost of AprioriAll depends on the 
length of the longest frequent sequences. AprioriAll would not be very efficient 
if the database contains long frequent sequences. We will give a more detailed 
discussion of AprioriAll in Section 21 

In ^U|, a faster version of AprioriAll called GSP was proposed. GSP shares 
a similar structure of AprioriAll in that it works iteratively. A performance 
study shows that GSP is much faster than AprioriAll. GSP can also be used to 
solve other generalized versions of the frequent-sequence mining problem. For 
example, a user can specify a sliding time window. Items that occur in transac- 
tions that are within the sliding time window could be considered as to occur in 
the same transaction. Also, the problem of mining multi-level frequent sequences 
is addressed. Although not shown in this paper, our approach of improving the 
I/O efficiency of AprioriAll can also be applied to GSP. Due to space limitation, 
that modification to GSP is not explicitly discussed in this paper. 

A very interesting I/O-efficient algorithm, SPADE, was proposed by Zaki [HJ ■ 
SPADE works on a “vertical” representation of the database, and it only needs 
three database scans to discover frequent sequences. While SPADE is an efliceint 
algorithm, it requires the availability of the “vertical” database. ISM jSj, an 
algorithm for incremental sequence mining, is based on SPADE; it also provides 
some kind of interactivity. 

In g] , Garofalakis et al. proposed the use of regular expressions as a tool for 
end-users to specify the kinds of frequent sequences that a system should return. 
Algorithms are proposed to mine frequent sequences with regular expression 
constraints. 

In 0, Chen et al. studied the problem of mining path traversal patterns for 
the World Wide Web. The goal is to discover the frequently occurring patterns 
of Web page visits. We can consider the path-traversal-pattern mining problem 
a special case of the frequent-sequence mining problem with which each “trans- 
action” contains a lone “item” (a page visit). 

3 Problem Definition 

In this section, we give a formal problem statement of mining frequent sequences. 
We also define some notations to simplify our discussion. 
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Let / = ■ ■ ■ ,im} be a set of literals called items. An itemset AT is a 

set of items (hence, X C I). A sequence s = (ti, ^2, ■ ■ ■ , tn) is an ordered set of 
itemsets. The length of s (represented by |s|) is defined as the number of itemsets 
contained in s. A sequence of length k is called a fc-sequence. 

Consider two sequences si = (oi, 02, . . . , Om) and S2 = {bi,b2, ■ ■ ■ ,bn) ■ We 
say that si contains S2 if there exist integers ji, J2, • ■ • ,jn, such that 1 < ji < 
j2 < . . . < < m and bi C ,62 C aj^, . . . ,bn C . We represent this 

relationship by S2 E si- 

As an example, the sequence S2 = ({a}, {b, c}, {d}) is contained in si = 
({e}, {a, d}, {g}, {6, c, /}, {d}) because {a} C {a, d}, {6, c} C {6, c, /}, and {d} C 
{d}. Hence, S2 E si- On the other hand, S2 E ^3 = ({a, d}, {d}, {6, c, /}) because 
{6, c} occurs before {d} in S2, which is not the case in S3. 

Given a sequence set V and a sequence s, if there exists a sequence s' G V 
such that s E s', we write s h E. In words, we say s is contained in some 
sequence of V. 

Given a sequence set V, a sequence s S E is maximal if s is not contained in 
any other sequences in E except s itself. 

A database T> consists of a number of sequences. The support of a sequence s 
is defined as the fraction of all sequences in T> that contain s. We use sup{s) 
to denote the support of s. If the support of s is no less than a user specified 
support threshold Ps, s is a frequent sequence. The problem of mining frequent 
sequences is to find all maximal frequent sequences given a sequence database V. 

We use the symbol Li to denote the set of all length-i frequent sequences. 
Also, we use L to denote the set of all frequent sequences. That is L = Li. 

4 AprioriAll 

In this section we review the AprioriAll algorithm. AprioriAll solves the 
problem of mining frequent sequences in the following five phases. 

1. Sort Phase. This phase transforms a transaction database to a sequence 
database by sorting the database with customer-ID as the major key and 
transaction time as a minor key. 

2. Litemset (Large Itemset) Phase. In this phase, the set of frequent itemsets 

LIT is found using the Apriori algorithm. The support of an itemset X is 
defined as the fraction of sequences in the database which contain a trans- 
action T such that X By definition, the set of frequent 1-sequences is 

simply {< t > \t G LIT}. Hence, all frequent 1-sequences are found in the 
Litemset Phase. 

3. Transformation Phase. In this phase, every frequent itemset found in the 
Litemset Phase is mapped to a unique integer. Each transaction in the 
database is then replaced by the set of all frequent itemsets (identified by 

^ Note that this definition of support is sligtly different from that of the traditional 
association rule mining model. 
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their unique integers) that are contained in that transaction. If a transac- 
tion does not contain any frequent itemset, the transaction is simply thrown 
away. 

4. Sequence Phase. All frequent sequences are found in this phase. Simi- 
lar to Apriori, AprioriAll is an iterative algorithm. It uses a function 
apriori-generate to generate candidate sequences, given a set of frequent 
sequences. Candidates generated in an iteration are of the same length. In 
general, during the i-th iteration, apriori-generate is applied to the set Li 
to generate candidate sequences of length i + 1. The database is scanned to 
count the supports of the candidates. Those frequent ones are put into the set 
Li+i. The iteration terminates when no new frequent sequences are found. 
To generate candidate sequences in the z-th iteration, apriori-generate 
considers every pair of frequent sequences si and S 2 from such that: 

Si = (Al, A 2 , ..., Ai_i, A), S 2 = (Al, A 2 , ..., Xi_i, B), 

where Ai, ..., Ai_i, A, B are all itemsets. That is, the first i — 1 itemsets in 
Si and S 2 are exactly the same. Two new sequences: 

mi = (Al, A 2 , ..., Ai_i, A, S) and m 2 = (Ai, A 2 , ..., Ai_i, i?. A) 

are generated. AprioriAll then checks to see if all length-z subsequence^ 
of mi are in Li. If so, mi is put into the candidate set. The sequence m 2 is 
similarly checked. Note that if the length of the longest frequent sequence is 
n, AprioriAll would scan the database at least n — 1 times in this phase. 

5. Maximal Phase. Frequent sequences which are not maximal (i.e., they are 
contained by some other frequent sequences) are deleted in this phase. 

5 FFS and FGen 

Among the five phases of AprioriAll, the Litemset Phase (for finding frequent 
itemsets) and the Sequence Phase (for finding frequent sequences) are the most 
I/O intensive. To reduce the I/O cost, one needs to find efficient algorithms for 
these two phases. Algorithms like DIG 0, Pincer-Search 0, and FlipFlop [Z] 
are example I/O efficient algorithms for finding frequent itemsets. They can be 
used to improve the efficiency of the Litemset Phase. In this paper, we focus on 
improving the Sequence Phase. This section introduces our algorithm FFS and 
its candidate generating function FGen. 

To reduce the I/O cost of the Sequence Phase, FFS first finds a suggested 
frequent sequence set, or an estimate. We denote this set by L. If the database is 
regularly updated and that frequent sequences are mined periodically, then the 
result obtained from a previous mining exercise can be used as the estimate. If 
such an estimate is not readily available, we could mine a small sample (let’s 
say 10%) of the database to obtain L. 

^ A subsequence of a given sequence s is a sequence obtained by deleting one or more 
itemsets from s. 
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As has been explained in the previous section, all length- 1 frequent sequences 
(i.e., the set Li) are already found during the Litemset Phase. FFS then con- 
catenates every possible pair of sequences from Li to form a set of length-2 
candidates. (Essentially, we are applying apriori-generate to L\ just like 
AprioriAll does.) The database is then scanned to verify which length-2 candi- 
date sequences are frequent as well as which sequences in L are frequent. The set 
L is then updated to contain only those frequent sequences found. After that, FFS 
iterates the following two steps: (1) apply FGen on L to obtain a candidate set C; 
(2) scan the database to find out which sequences in C are frequent. Those fre- 
quent candidate sequences are added to L. This successive refinement procedure 
stops when FFS cannot find any new frequent sequences in an iteration. 

Figure Q] shows the algorithm FFS. The algorithm takes as inputs a trans- 
formed database (C onvertedD) obtained from the Transformation Phase, the set 
of length-1 frequent sequences (Li) obtained from the Litemset Phase, and a sug- 
gested set of frequent sequences (Best) obtained perhaps by applying AprioriAll 
on a database sample. FFS maintains a set MFSS which contains all and only 
those maximal sequences of the set of frequent sequences known at any instant. 
Since all subsequences of a frequent sequence are frequent, MFSS is sufficient to 
represent the set of all frequent sequences known. 



1 Algorithm FFS{C onvertedD, ps, Li, Sest) 

2 MFSS := Li 

3 CandidateSet := {(ti, t 2 )| (ti), {tfi) £ Ai} U {s|s h S'esi, |s| > 2} 

4 Scan ConvertedD to get support of every sequence in CandidateSet 

5 New Frequent Sequences := {s|s G CandidateSet, sup{s) >= ps} 

6 Already Counted := {s|s h Sest, |s| > 2} 

7 Iteration := 3 

8 while{N ewFrequentSequences 7 ^ 0) 

9 / /Max{S) returns the set of all maximal sequences is S 

10 MFSS ~ Max{MFSS U New Frequent Sequences) 

11 CandidateSet := FGen(MFSS, Iteration, Already Counted) 

12 Scan ConvertedD to get support of every sequence in CandidateSet 

13 N ewFrequentSequences := {s|s G CandidateSet, sup{s) >= ps} 

14 Iteration := Iteration+1 

15 Return MFSS 



Fig. 1. Algorithm FFS 



The most important component of FFS is the FGen function (see Figure □). 
The function takes three parameters, namely, MFSS — the set of all maximal 
frequent sequences known so far; Iteration — a loop counter that FFS maintains; 
and Already Counted — a set of sequences whose supports have already been 
counted. 

FGen generates candidate sequences given a set of frequent sequences by 
“joining” MFSS with itself (lines 3-6). A candidate sequence m is removed (from 
the candidate set) if any one of the following conditions is true (lines 7-11): 
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— m is contained in some sequence already known to be frequent. (Since m 
must then be frequent, we do not need to count its support.) 

— m’s support has already been counted. 

— some of m’s subsequences are not known to be frequent. 



1 Function FGen{MFSS, Iteration, Already Counted) 

2 

3 

4 

5 

6 

7 

8 

9 

10 
11 
12 

13 

14 

15 



5.1 

In this subsection, we summarize a few properties of FFS and FGen in the fol- 
lowing theorems. Due to space limitation, we refer the readers to m for the 
proofs. For convenience, we use the symbol L to represent the set of all frequent 
sequences found by AprioriAll, C AprioriAii to represent the set of all candidate 
sequences generated (and whose supports are counted) by apriori-generate, 
and Cpaen to represent the set of all candidate sequences generated (and whose 
supports are counted) by FGen. Also, if A is a sequence set, we use Max{X) to 
represent the set of all maximal sequences in X. 

Theorem 1 When FFS terminates, MFSS = Max(L). 

Since Max{L) is the set of maximal frequent sequences found by AprioriAll 
and MFSS is the set of maximal frequent sequences found by FFS, Theorem ^ 
says that AprioriAll and FFS discover the same set of maximal frequent se- 
quences. Therefore, if AprioriAll finds out all maximal frequent sequences in 
the database, so does FFS. Hence, FFS is correct. 

Theorem 2 CpQen ^ C AprioriAll • 

Theorem |2| says that the set of candidates generated by FGen is a subset of 
that generated by AprioriAll. Thus FGen does not generate any unnecessary 
candidates and waste resources for counting their supports. 



NewCandidate ~ (tl,S,t 2 ) 



CandidateSet := 0 

for each pair of si,S 2 € MFSS such that |si| > Iteration-2, |s 2 | > Iteration-2 
and that si, S 2 share at least one common subsequence of length > Iteration-2 
for each common subsequence s of si,S 2 such that |s| > Iteration-2 

{tl,s) C Sl, {s,t 2 ) c S2 1 
or {ti,s) c S2, (s,t 2 ) c Sl, J 
CandidateSet := CandidateSet U NewCandidate 
for each sequence s € CandidateSet 

if (s h MFSS) delete s from CandidateSet 
if s £ Already Counted delete s from CandidateSet 
for any subsequence s' of s with length |s| — 1 
if (s' MFSS) delete s from CandidateSet 
Already Counted ~ Already Counted UCandidateSet 
for each sequence s G Already Counted 

if (|s| = Iteration) delete s from Already Counted 
Return CandidateSet 



Fig. 2. Function FGen 



Theorems 
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6 Performance 

We performed a number of experiments comparing the performance of 
AprioriAll and FFS. Our goals are to study how much I/O cost FFS could 
save, and how effective sampling is in discovering an initial estimate of the set 
of frequent sequences required by FFS. In this section, we present some repre- 
sentative results from our experiments. 

We used synthetic data as the test databases. The generator is obtained from 
the IBM Quest data mining project. Readers are referred to jS| for the details of 
the data generator. The values of the parameters we used in the data generation 
are listed in Table [I] 



Table 1. Parameters and values for data generation 



Parameter 


Description 


value 


\D\ 


Number of customers (size of Database) 


100,000 


|o| 


Average number of transactions per Customer 


10 


\T\ 


Average number of items per transaction 


5 


|5| 


Average length of maximal potentially large Sequences 


4 


|/| 


Average size of Itemsets in maximal potentially large Sequences 


1.25 


Ns 


Number of maximal potentially large Sequences 


5,000 


Nt 


Number of maximal potentially large Itemsets 


25,000 


N 


Number of items 


10,000 



6.1 Coverages and I/O Savings 

Recall that FFS requires a suggested frequent sequence set, Sest- In this subsec- 
tion, we study how the “coverage” of Sest affects the performance of FFS. By 
coverage, we mean the fraction of the real frequent sequences that are contained 
in Seat- It is defined as 



coverage = 



\{s\shSest}r^{u°^sL^)\ 
I 0^3 Li\ 



where Li represents the set of all frequent sequences of length i. For our definition 
of coverage, we only consider those frequent sequences that are of length 3 or 
longer. This is because the set Li is already discovered in the Litemset Phase 
and that all length-2 candidate sequences will be checked by FFS during its first 
scan of the database (see Figure [I] line 3). Therefore, whether Sest contains 
frequent 1-sequences or frequent 2-sequences are immaterial, and the number of 
I/O passes required by FFS is not affected. 

In our first experiment, we generated a database using the parameter values 
listed in Table E We then applied AprioriAll on the database to obtain L, the 
set of all frequent sequences. We then randomly selected a subset of sequences 
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from L, forming an estimated set Sest- FFS was then applied on the database 
using Sest- The number of I/O passes used by AprioriAll and FFS in the Se- 
quence Phase were then compared. We repeated the experiment using different 
support thresholds, ps- The results of the experiments (for ps — 0.75%, 0.5%, 
and 0.4%) are shown in Figure 0 




Coverage 



Fig. 3. Number of I/O passes vs. coverage under different support thresholds 



In Figure 0 we use three kinds of points (‘O’, ‘-F’, ‘D’) to represent three 
support thresholds 0.75%, 0.50%, 0.40% respectively. For example, point A is 
represented by O and its coordinate is (0.83,4). This means that when the sup- 
port threshold was 0.75% and the coverage of Sest was 0.83, FFS took 4 1/0 
passes. We see that when coverage increases, the number of I/O passes required 
by FFS decreases. 

We observe that the curves follow a general trend: 

— When coverage = 0 (e.g., when Sgst is empty), FFS degenerates to 
AprioriAll. They thus require the same number of I/O passes. 

— When coverage is small, Sest would contain very few long frequent sequences. 
This is because if Sest covers a long frequent sequence s, it also covers every 
subsequence of s. These subsequences are frequent and if s is long they are 
numerous. The coverage of Sest would thus be high. Since few long frequent 
sequences are covered by Sest, quite a number of I/O passes are required to 
discover them. Hence, with a small coverage, FFS does not reduce the I/O 
cost at all. 

— When coverage is moderate, FFS becomes more effective. The amount of I/O 
saving increases with the coverage. 

— When coverage is 100% (i.e., Sest covers all frequent sequences in the 
database), FFS requires only two passes over the database: one pass to verify 
that the sequences in Sest are all frequent, another pass to verify that no 
more frequent sequences can be found. 
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6.2 Sampling 

One way to obtain the set S^st is to apply AprioriAll on a database sample. In 
the next set of experiments, we study this sampling approach. We first applied 
AprioriAll on our synthetic dataset to obtain the number of I/O passes it 
required in the Sequence Phase. Then a random sample of the database was 
drawn on which AprioriAll was applied to obtain Sest- We then executed FFS 
using the Sest found. This exercise was repeated a number of times, each with a 
different random sample. The average amount of I/O cost required by FFS was 
noted. This I/O cost includes the cost of mining the sample. For example, if a 
1/8 sample was used, AprioriAll scaned the sample database 6 times to get 
Sest, and FFS scaned the database 3 times to get all maximal frequent sequences, 
then the total I/O cost is calculated as: 1/8 x 6 + 3 = 3.75 passes. 

Besides I/O cost, we also compare the number of candidates the two algo- 
rithms counted and the CPU time they took. The number of candidates FFS 
counted is measured by the following formula: 

(# of candidate counted to obtain Sest) x (sample size) 

-|-(# of candidate counted in FFS). 

Similarly, the amount of CPU time FFS took includes the CPU time for 
mining the sample. 

The result of the experiment (with different sample sizes) is shown in Table 
El Note that the average number of candidates counted and the average CPU 
cost of FFS are shown in relative quantities (with those of AprioriAll set to 1). 



Table 2. Performance of FFS vs sample size (ps = 0.75%) 



sample size 


1/128 


1/64 


1/32 


1/16 


1/8 


1/4 


1/2 


O(AprioriAll) 


avg. coverage 


0.673 


0.703 


0.763 


0.806 


0.855 


0.904 


0.946 


N/A 


avg. I/O cost 


5.095 


4.859 


4.541 


4.448 


4.357 


4.691 


5.938 


6 


avg. # of cand. 


1.087 


1.069 


1.078 


1.083 


1.133 


1.254 


1.502 


1 


avg. CPU cost 


1.152 


1.136 


1.138 


1.157 


1.212 


1.355 


1.599 


1 



From Table 01 we see that FFS needed fewer I/O passes than AprioriAll (6 
passes). As the sample size increases, the coverage of Sest becomes higher, and 
fewer I/O passes are needed for FFS to discover all the frequent sequences given 
Sest- This accounts for the drop of I/O cost from 5.095 passes to 4.357 passes 
as the sample size increases from 1/128 to 1/8. As the sample size increases 
further, however, the I/O cost of mining the sample becomes substantial. The 
benefit obtained by having a better-coverage Sest is outweighted by the penalty 
of mining the sample. Hence, the overall I/O cost increases as the sample size 
increases from 1/8 to 1/2. Also, as the sample size increases, more work is spent 
on candidate counting, and the CPU cost increases. For a 1/16 sample, for 
example, FFS reduces about _ 2 q% of the I/O cost at the expense of a 

16% increment in CPU cost. 

We can further improve the performance of FFS by using a slightly smaller 
support threshold {ps_sampie) to mine the sample. The idea is that by using 
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a small Ps_sampie, more sequences will be included in Sest- This will potentially 
improve the coverage of Seat, and hence a larger reduction in I/O cost is achieved. 
The disadvantage of this trick, however, is a larger CPU cost, since more support 
counting would have to be done to verify which sequences in S^st are frequent 
or not. 

Table 0 shows the performance of FFS using different values of Ps_sampie- In 
this set of experiments, the sample size is fixed at 1/16 and ps is 0.75%. 



Table 3. Performance of FFS with different values of Ps_sampU {Ps = 0.75%, sample 
size = 1/16) 



P s ^sample 


0.750% 


0.675% 


0.600% 


0.525% 


0.450% 


0.375% 


AprioriAll 


avg. coverage 


0.806 


0.939 


0.986 


0.998 


1.000 


1.000 


N/A 


avg. I/O cost 


4.448 


3.731 


3.215 


2.544 


1.913 


1.450 


6 


avg. of cand. 


1.083 


1.127 


1.183 


1.255 


1.304 


1.322 


1 


avg. CPU cost 


1.157 


1.149 


1.150 


1.162 


1.175 


1.173 


1 



From Table 0 we see that when Ps_sampie decreases, the I/O cost of FFS 
decreases dramatically, while the CPU cost increases only very mildly. For ex- 
ample, when Ps_sampie = 0.450%, the average coverage reaches 100%, and the 
average I/O cost is 1.913. Compared with Ps_sampie = 0.750% (i.e. Ps_sampie= 
Ps), we saved an additional (4.448 — 1.913)/6 = 42% I/O cost at the expense of 
an additional (1.175 — 1.157)/1 = 1.8% increment in CPU cost. 

Notice that 1.913 passes of I/O already included the I/O cost of mining 
the 1/16 sample. We looked into our experimental result and we found that 
in some instances, FFS only took 1 pass to complete the finding of frequent 
sequences given an S^st- With a small Ps_sampie, Sest includes many infrequent 
sequences. During the first iteration of FFS (after sampling), the support of 
these infrequent sequences are also counted. During the second iteration of FFS, 
candidate sequences are generated. However, in those instances, all the candidate 
sequences were in the set Sest- Since their supports were already counted in the 
first iteration, no database scan is needed. FFS thus terminated with only 1 pass 
over the database (plus the I/O needed to mine the sample). This is why the 
average I/O cost for Ps_sampie= 0.45% is less than 2. 

7 Conclusion 

In this paper, we proposed a new I/O efficient algorithm FFS to solve the problem 
of mining frequent sequences. A new candidate generation method FGen was 
proposed which can generate candidate sequences of multiple lengths given a 
set of suggested frequent sequences. We performed experiments to compare the 
performance of FFS and AprioriAll. We showed that FFS saves I/O passes 
significantly, especially when an estimate (Seat) of the set of frequent sequences 
with a good coverage is available. We showed how mining a small sample of the 
database leads to a good Seat- By using a smaller support threshold {pa_sampie) 
in mining the sample, we showed that FFS outperforms AprioriAll by a wide 
margin. The I/O saving is obtained, however, at a mild CPU cost. 



An I/O-Efficient Algorithm for Mining Frequent Sequences 305 



References 

1. Rakesh Agrawal and Ramakrishnan Srikant. Mining sequential patterns. In Proc. 
of the 11th Int’l Conference on Data Engineering, Taipei, Taiwan, March 1995. 

2. Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, and Shalom Tsur. Dynamic 
itemset counting and implication rules for market basket data. In Proeeedings of 
the ACM SIGMOD Conference on Management of Data, 1997. 

3. Ming-Syan Chen, Jong Soo Park, and Philip S. Yu. Efficient data mining for path 
traversal patterns. IEEE Transactions on Knowledge and Data Engineering, 10(2), 
March/ April 1998. 

4. Minos N. Garofalakis, Rajeev Rastogi, and Kyuseok Shim. SPIRIT: Sequential 
pattern mining with regular expression constraints. In Proceedings of the 25th 
International Conference on Very Large Data Bases, Edinburgh, Scotland, UK, 
September 1999. 

5 . http : / / WWW. almaden .ibm. com /cs/ quest / . 

6. Dao-I Lin and Zvi M. Kedem. Pincer-search: A new algorithm for discovering 
the maximum frequent set. In Proceedings of the 6th Conference on Extending 
Database Technology (EDBT), Valencia, Spain, March 1998. 

7. K.K. Loo, C.L. Yip, Ben Kao, and David Cheung. Exploiting the duality of maxi- 
mal frequent itemsets and minimal infrequent itemsets for I/O efficient association 
rule mining. In Proc. of the 11th International Conference on Database and Expert 
Systems Conference, London, Sept. 2000. 

8. S. Parthasarathy, M. J. Zaki, M. Ogihara, and S. Dwarkadas. Incremental and 
interactive sequence mining. In Proceedings of the 1999 ACM 8th International 
Conference on Information and Knowledge Management (CIKM’99), Kansas City, 
MO USA, November 1999. 

9. T. Imielinski R. Agrawal and A. Swami. Mining association rules between sets of 
items in large databases. In Proc. ACM SICMOD International Conference on 
Management of Data, page 207, Washington, D.C., May 1993. 

10. Ramakrishnan Srikant and Rakesh Agrawal. Mining sequential patterns: Gen- 
eralizations and performance improvements. In Proc. of the 5th Conference on 
Extending Database Technology (EDBT), Avignion, France, March 1996. 

11. Mohammed J. Zaki. Efficient enumeration of frequent sequences. In Proceedings 
of the 1998 ACM 7th International Conference on Information and Knowledge 
Management(CIKM’98), Washington, United States, November 1998. 

12. Minghua Zhang, Ben Kao, C.L. Yip, and David Cheung. FES - An I/O efficient 
algorithm for mining frequent sequences. Technical Report CS Technical Report 
TR-2000-6, University of Hong Kong, 2000. 




Sequential Index Structure for Content-Based Retrieval 



Maciej Zakrzewicz 

Poznan University of Technology, Institute of Computing Science 
Piotrowo 3a, 60-965 Poznan , Poland 
mzakrz@cs.put.poznan.pl 



Abstract. Data mining applied to databases of data sequences generates a 
number of sequential patterns, which often require additional processing. The 
post-processing usually consists in searching the source databases for data 
sequences which contain a given sequential pattern or a part of it. This type of 
content-based querying is not well supported by RDBMSs, since the traditional 
optimization techniques are focused on exact-match querying. In this paper, we 
introduce a new bitmap-oriented index structure, which efficiently optimizes 
content-based queries on dense databases of data sequences. Our experiments 
show a significant improvement over traditional database accessing methods. 



1 Introduction 



Mining of sequential patterns consists in identifying trends in databases of data 
sequences [1,6]. A sequential pattern represents a frequently occurring subsequence. 
An example of a sequential pattern that holds in a video rental database is that 
customers typically rent "Star Wars", then "Empire Strikes Back", and then "Return of the 
Jedi". Note that 1. these rentals need not be consecutive , and 2. during a single visit, a 
customer may rent a set of videos, instead of a single one. Post-processing of 
discovered sequential patterns usually consists in searching the source databases for 
data sequences, containing a given sequential pattern. For example, when we discover 
an interesting sequential pattern in the video rental database, we would probably like 
to find all customers, who satisfy (contain) the pattern. We will refer to these types of 
searching as to content-based sequence retrieval. 



SID TS L 

1 1 A 

11 B 

12 C 

13 D 

2 1 A 

2 2 E 

2 2 C 

2 3 F 



SELECT 


SID 


FROM 


R Rl, R R2, R 


WHERE 


R1.SID=R2.SID 


AND 


R2 . SID=R3 .SID 


AND 


R1.TS<R2.TS 


AND 


R2 .TS<R3 .TS 


AND 


R1.L='A' 


AND 


R2.L='E' 


AND 


R3.L='F' ; 



Fig. 1. The relation of data sequences and the content-based sequence retrieval query 

In most cases, data sequences (and sequential patterns) are stored in relational 
databases. Consider the following example of using the relational approach to 
content-based sequence retrieval. The relation R(SID,TS,L) stores data sequences. Each 
tuple contains the sequence identifier {SID), the timestamp {TS), and the item (L). Our 
relation R describes two data sequences: {A,B}->{C}->{D}, {A}->{E,C}->{F}. Let the 
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searched data subsequence be: {A)^{E)^{F). Figure 1 gives the relation R and the 
SQL query, which implements the content-based sequence retrieval problem. 

Since mined databases tend to be very large, there is a problem of optimizing the 
database access while performing content-based sequence retrieval, e.g. by means of 
the above SQL query. Database research has developed many indexing techniques, 
like B-h trees [3], bitmapped indexes [8], k-d trees [2], R trees [5], to optimize queries 
based on exact matches of single tuples. However, these techniques do not 
significantly improve content-based sequence retrieval queries, which deal with 
partial matches of multi-tuple sequences. There are also proposals for set-based 
indexing [7] [4], which is used to improve subset searching. However, these methods 
work for retrieval of unordered sets of items only. 

In order to realize the shortcomings of the existing indexing methods, let us 
consider applying Bh- tree and set-based indexes to execute the query from Figure 1. 
Using a Bh- tree index, tuples containing all items of each data sequence are joined 
first (SID attribute), and then verified whether they contain given items in the given 
order. This can be fairly Ineffective since a data sequence may span across many disk 
block, what results In multiple scanning of each block of the relation. Using a set- 
based index, the sequence identifiers (SID attribute) of all sequences, which contain 
the searched items in any order, are found, and then the sequences are read from the 
relation (perhaps with help of a Bh- tree) to verify the ordering of their items. This 
approach gives much better results, as compared to a Bh- tree index, but a significant 
overhead comes from reading and verifying the sequences having incorrect ordering. 

In this paper we consider content-based retrieval of data sequences from dense 
databases, characterized by relatively small number of items, which occur frequently 
in various order (e.g. web logs), and therefore a set-based index is not efficient. We 
introduce a new bitmap-oriented indexing method, which optimizes the problem. The 
basic idea behind our method, as compared to set-based indexes, is that the index 
structure includes not only the items of a sequence, but also the ordering of the items. 
Basic definitions. Let L={l,,l^,...,l^] be a set of literals called items. Data sequence 
S=<Xj X^... X> is an ordered list of sets of items such that each set of items X. C'L. X. 
is called a sequence element. All items in a sequence element are unordered. We say 
that a data sequence <A, X^ ... X> is contained in another data sequence <T, ... Y> 

if there exist integers i, < i^< ... < i^ such that X, q Y.„ X^ q F,„ ...,X^q T,. Let D be a 
database of variable length data sequences. Let 5 be a data sequence. The problem of 
content-based sequence retrieval consists in finding in D all data sequences, which 
contain the data sequence S. 



2 Preliminaries 



Data sequences may contain categorical items of various data types. For sake of 
convenience, we convert the items to integers by means of an item mapping function. 
An item mapping function fi(x), where is a literal, is a function which transforms a 
literal into an integer value. For example, for a set of literals L = {A, B, C, D, E, F], an 
item mapping function can take the values: fi{A)=t,fi{B)=2,fi(C)=?i,fi{D)=A,fi(E)=5,fi(F)=6. 

Similarly, we use an order mapping function to express data sequence ordering 
relations by means of integer values. Thus, we will be able to represent data sequence 
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items as well as data sequence ordering uniformly. An order mapping function fo(x,y), 
where x and y are literals and fo(x,y) ^fo(y,x), is a function which transforms a data se- 
quence <{x){y)> into an integer value. For the set of literals used in the previous exam- 
ple, an order mapping function can be: fo(x,y) = +fi(y), e.g. fo(C,F) = 24. 

Using the above definitions, we will be able to transform data sequences into item 
sets, which are easier to manage, search and index. An item set representing a data 
sequence is called an equivalent set. An equivalent set E for a data sequence S = <X,X^ 
... X> is defined as: 
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where: fi() is an item mapping function and fo() is an order mapping function. For 
example, for the data sequence S = <{A,B){C){D)> and the presented item mapping 
function and order mapping function, the equivalent set E is evaluated as follows: 
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= {/i(A)l u {/i(S)} u [fi{C)} u {/i(D)} u {/o(A, O) u {/o(fi,C)} u (2) 

u {/o(A, Z))} u {/o(S, D)) u {/o(C, D)} = {1, 2, 3, 4, 9, 15, 10, 16, 22) 

Notice that for any two data sequences S, and S^, we have: contains S, if where 

E, is the equivalent set for S„ and E^ is the equivalent set for S,. This property is not 
reversible. 

Since the size of an equivalent set quickly increases while increasing the number of 
the original sequence elements, we split data sequences into partitions, which are 
small enough to process and encode. We say that a data sequence S = <X, X^ ... X> is 
partitioned into data sequences S,= <X,...X^>, S,= <X^,^^...X^> with level p 
if for each data sequence S, the size of its equivalent set l£,l < p and for all x,y e X, u 
U...UX, where x precedes y, we have: either <{x){y)> is contained in S, or (x) is 
contained in S„ and (y) is contained in S., where i<j (p should be greater than maximal 
item set size). For example, partitioning the data sequence S=<{A,B){C){D){A,F) {£){£)> 
with level 10 results in two data sequences: S,=<{A,B){C){D)> and S 2 =<{A,F){B){£)>, 
since the sizes of the equivalent sets are respectively: IF) = 9 (F, = {1,2,3,4,9,15,10,16,22]), 
and IF) = 9 (F^ = [1,6,2,5,8,38,11,41,17]). Notice that for a data sequence S partitioned into 
S,, S^, ..., Sj, and a data sequence Q, we have: S contains Q if there exists a partitioning of 
Q into Q„ Qj, ..., such that Q, is contained in S„, is contained in ..., Q„ is 
contained in 5^, and i, < < ... < 

Our index structure will consist of equivalent sets stored for all data sequences, 
optionally partitioned to reduce the complexity. To reduce storage requirements, 
equivalent sets will be stored in database in the form of bitmap signatures. The 
bitmap signature of a set X is an W-bit binary number created, by means of bit-wise 
OR operation, from the hash keys of all data items contained in X. The hash key of the 
item xeX is an W-bit binary number defined as follows: hash_key(x) = 2** 

The bitmap signature of the set X is the bit-wise OR of all items’ hash keys. Notice 
that for any two sets X and Y, if Xa¥ then: bit_sign(x) and bit_sign(y) = 
bit_sign {X) , where AND is a bit-wise AND operator. This property is not reversible. 
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3 Sequential Index Construction Algorithm 



The sequential index construction algorithm iteratively processes all data sequences in 
the database. First, the data sequences are partitioned with the given level p. Then, for 
each partition of each data sequence, the equivalent set is evaluated. In the next step, 
for each equivalent set, its W-bit bitmap signature is generated and stored in the 
database. The formal description of the algorithm is given below. 

for each data sequence S e D do begin 
partition S into partitions S^, S 2 , S 3 with level |3; 
for each partition S± do begin 

evaluate equivalent set E± for Si; 
bitmapi = bit_sign (Ei) ; store bitmapi in the database; 
end; 
end; 

Example. Assume that p=l 0 , W= 16 , and the database D contains three data sequences: S, 
= <{A,B}{C} {D){A,F){B){£)>, S,= <{A){C,£){F){B){£){A,D)>, S, = <{B,C,D],{A]>. First, we 
partition the data sequences with p=l 0 . Notice that S, is, in fact, not partitioned since 
its equivalent set is small enough. The symbol 5 y denotes 7-th partition of the i-th data 
sequence: Si,i= < {a, b} {c} {d} >, Si,2= <{a, f} {b} {e}>, 82,1= < {a} {c, e} {f} >, 

82,2= <{b} {e} {a,d}>, 83,1= <{b, c,d} {a} >. Then we evaluate the equivalent sets for 
the partitioned data sequences. We use the example item mapping function and order 
mapping function. The symbol F^ denotes the equivalent set for S^.. 
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In the next step, we generate 16 -bit bitmap signatures for all equivalent sets. 

bit_sign (Ei,i) = 1000011001011111 bit_sign (£ 2 , 2 ) = 1010000000110111 

bit_sign (£ 1 , 2 ) = 0000101101100110 bit_sign (£ 3 , 1 ) = 0010001000011110 

bit_sign(E 2 ,i) = 0000101101111010 

Finally, the sequential index is stored in the database in the following form: 
{sid=l: bit_sign=| 100001 100101 nil, 0000101101100110), sid=2: bit_sign={0000101101111010, 
1010000000110111), sid=3: bit_sign={0010001000011110)) 



4 Using Sequential Index for Content-Based Retrieval 

During content-based sequence retrieval, the bitmap signatures for all data sequences 
are scanned. For each data sequence, the test of a searched subsequence mapping is 
performed. If the searched subsequence can be successfully mapped to the data 
sequence partitions, then the data sequence is read from the database. Due to the 
ambiguity of bitmap signature representation, additional verification of the retrieved 
data sequence is required. The verification can be performed using the traditional Bh- 
tree method, since it consists in reading the data sequence from the database and 
checking whether it contains the searched subsequence. The formal description of the 
algorithm is given below. We use a simplified notation of Q[i_start..i_end] to denote a 
partition ... X.^j> of a sequence Q = <X^X^ ... X>, where l<i_start<i_end^. The 

symbol & denotes bit-wise AND operation. 




310 M. Zakrzewicz 



for each sequence identifier sid do begin 
j = 1; i_end = 1; 

repeat 

i_start = i_end; 

evaluate equiv. set Eg for Q[i_start . . i_end] ; masA:=bit_sign (£q) ; 

while mask & hit_sign {Esid, i) <>^^sk and j<=#partitions for sid do j++; 

if j<= number of partitions for sid then repeat 

i_end++; generate equivalence set Eg for Q[i_start . . i_end] ; 
mask = bit_sign{Sg) ; 

until mask & bit_sign (Esid,i) <> mask or i_end = size of Q; 
until i_start = i_end or j > number of partitions for sid; 
if J <= number of partitions then return(sid); 
end; 

Example. Assume that we look for all data sequences, which contain the subsequence 
<{^')1B1|0}>- We begin with sid=\. We find that <{F)> matches the first partition. So, 
we check whether <{F),{B)> also matches this partition. Accidentally it does, but when 
we try <{F),{B),{D)>, we find that it does not match the first partition. Then we move 
to the second partition to check whether <{D)> matches the partition. This test fails 
and since we have no more partitions, we reject sid=\ (this data sequence does not 
contain the given subsequence). In the next step, we check sid=2. We find that <{F}> 
matches the first partition. So, we check whether <{F),{B)> also matches this partition. 
It does not, so we move to the second partition and find that <{B)> matches the 
partition. Then we must check whether <{B),{D)> also matches the partition. This time 
the check is positive and since we have matched the whole subsequence, we return 
sid=2 as a part of the result. The data sequence will be verified later. Finally, we 
check sid=3. We find that <{F)> does not match the first partition. Since we have no 
more partitions, we reject sid=3 (this data sequence does not contain the given 
subsequence). So far, the result of our index scanning is the data sequence identified 
by sid=2. We still need to read and verify, whether the sequence really contains the 
searched subset. Here it does - the result is returned to a user. 
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Fig. 2. Experimental results 
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5 Experimental Results 

We performed experiments on OracleS (randomly generated data sets - uniform item 
distribution). Figure 2A shows the number of disk blocks read to retrieve data 
sequences containing subsequences of various lengths. The data set contained 50000 
data sequences, having 20 items of 50 in average. The compared database accessing 
methods were: Bh- tree index (B+ tree), 24-bit set-based bitmap index (24S), 32-bit 
sequential index ((3 = 28) built on top of 24-bit set-based bitmap index (24S32Q28), and 
48-bit sequential index ((3 = 55) built on top of 24-bit set-based bitmap index 
(24S48Q55). Our index achieved a significant improvement for the searched 
subsequences of length greater than 4. 

We also analyzed the influence of the partitioning level on the index performance. 
Figure 2B illustrates the filtering factor for three sequential indexes built on bitmap 
signatures of total size of 48 bits, but with different partitioning. We noticed that 
partitioning data sequences into a large number of partitions results in performance 
increase for long subsequences, but worsens the performance for short subsequences. 
Using a small number of data sequence partitions results in more "stable" 
performance, but worse for long subsequences. 



6 Final Conclusions 

Content-based sequence retrieval is specific in the sense that it requires complicated 
SQL queries and database access methods. In this paper we have introduced the new 
indexing method, called sequential indexing, which can replace a Bh- tree index and 
set-based indices for dense databases. During experiments, we have found that the 
most efficient solution is to combine a set-based index (which checks items of a data 
sequence) with a sequential index (which checks the items ordering), what results in 
dramatic outperforming Bh- tree access methods. 



References 

1. Agrawal, R., Srikant, R., Mining Sequential Patterns, Proc. 1 1“' ICDE, 1995 

2. Bentley, J.L., Multidimensional binary search trees used for associative searching. Comm, 
of the ACM 18 

3. Comer D., The Ubiquitous B-tree, Comput. Surv. 11, 1979 

4. Diamantini, C., Panti, M., A Conceptual Indexing Method for Content-Based Retrieval, 
Proc. 15“ ICDE, 1999 

5. Guttman, A., R-trees: A dynamic index structure for spatial searching, Proc. ACM 
SIGMOD Conf., 1984 

6. Mannila H., Toivonen H., Verkamo A.I., Discovering frequent episodes in sequences, Proc. 
1” KDD, 1995 

7. Morzy, T., Zakrzewicz, M., Group Bitmap Index: A Structure for Association Rules 
Retrieval, Proc. 4“ KDD, 1998 

8. O’Neil, P, Model 204 Architecture and Performance, Springer- Verlag Lecture Notes in 
Computer Science 359, 2nd HTPS, 1987 




The S'^-Tree: An Index Structure for 
Subsequence Matching of Spatial Objects 



Haixun Wang and Chang-Shing Perng 

IBM T. J. Watson Research Center 
Yorktown Heights, NY 10598 
{haixun, perngjOus . ibm. com 



Abstract. We present the S^-Tree, an indexing method for subsequence 
matching of spatial objects. The S^-Tree locates subsequences within a 
collection of spatial sequences, i.e., sequences made up of spatial objects, 
such that the subsequences match a given query pattern within a speci- 
fied tolerance. Our method is based on (i) the string-searching techniques 
that locate substrings within a string of symbols drawn from a discrete 
alphabet (e.g., ASCII characters) and (ii) the spatial access methods that 
index (unseqnenced) spatial objects. Particularly, the S^-Tree can be ap- 
plied to solve problems such as subsequence matching of time-series data, 
where features of subsequences are often extracted and mapped into spa- 
tial objects. Moreover, it supports queries such as “what is the longest 
common pattern of the two time series?”, which previous subsequence 
matching algorithms find difficult to solve efficiently. 



1 Introduction 

The sequence of objects can endow it with some special significance that an 
unsequenced grouping of the same objects could never convey. In this paper, 
we focus on the design of fast searching methods that will search a database of 
sequences of text, spatial, or multimedia objects to locate those that match a 
query subsequence of objects. Such sequences can be 1-dimensional time series, 
digitized voice or music, video clips, trail of mobile objects: 

— Time series databases. The efficient matching 0 of time series data often 
relies on some distance-preserving transform, such as the Discrete Fourier 
transform (DFT), which extracts the first / DFT coefficients and map them 
into points in the /-dimensional feature space. 

— Content-based image querying. A similarity retrieval algorithm for image 
databases extracts image regions and uses Harr wavelet fTI\ to compute 
their signatures by mapping them to some multidimensional space. 

— Content-based analysis, indexing, and retrieval of audio or video sequences. 
For instance, VideoTrails' approach to analyzing a video clip involves first 
generating a trail of points in a multidimensional space where each point is 
derived from physical features of a single frame in the video cliofTTH. 

— Spatio-temporal databases, which deal with geometries changing over 
timejl b). 
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The above databases and the queries processed against them have the fol- 
lowing characteristics in common: 

— Entities in the databases are either spatial objects, or can be converted into 
spatial objects through feature extraction. Examples of feature vectors are 
color histograms jS| , Fourier vectors, text descriptors jlSj, etc. 

— There exists an order among the entities. Objects in time series databases, 
audio, and video sequences are ordered by time. In content-based image 
querying, an image is often decomposed into a set of sub images which can 
be partially ordered by their relative positions in the original image. In this 
paper, we assume there is a total order among the entities. 

Taking advantage of the above characteristics, we find that the task of sub- 
sequence matching against sequences of text, spatial, or multimedia objects is 
essentially the following problem: Given a query spatial sequence, search a set of 
spatial sequences to locate subsequences that are similar to the query sequence. 

However, traditional database indexing techniques are inadequate for this 
purpose. There is currently much excellent work in indexing multidimensional 
data, including geometric h ashing [Tlj. grid-based index structures(S|, and the 
R-tree family j7H] index structures. These spatial access methods are designed 
to index unsequenced objects. The order among the entities is not taken into con- 
sideration when the index structures are created and hence no effective retrieval 
method in terms of subsequence matching of spatial objects is supported. 

Our work extends the substring matching technique to spatial sequences. 
We propose a new index structure, the S'^-Tree, that can be applied to search 
databases of different contents when the features of the data are extracted into 
sequences of spatial objects. It also supports new SQL predicates, for example, 
sound like and look like, which are similar to the standard like predicate for 
substring matching, for queries in multimedia databases. In this paper, we focus 
on time series, where temporal patterns are usually mapped to feature vectors 
in some high-dimensional spaced- 

The organization of the rest of the paper is as follows. We first review some 
background material, including spatial access methods and the suffix tree. In 
Section El we propose S'^-Tree for fast subsequence matching of spatial objects. 
In Section 2] we use 5^ - Tree for subsequence matching in time series databases. 
Section 0 contains experiments that show the effectiveness of our algorithms. 

2 Background 

The R-tree [7| can be viewed as an extension of the B-tree to multi-dimensions. 
The i?*-treejni improves the R-tree by introducing a policy called forced rein- 
sert. It also refines the node splitting policy by taking overlapping area and 
region perimeter into consideration. Another modification of the R-tree, called 
X-tree0, is well suited for indexing high-dimensional data. The main idea of 
the X-tree is to avoid overlap of bounding boxes in the directory by using a 
new organization of the directory which is optimized for high dimensional space. 



314 H. Wang and C.-S. Perng 



Instead of allowing splits that introduce high overlaps, X-tree postpones node 
splitting by introducing supernodes, i.e., nodes larger than the usual block size. 

A suffix tree\f\ embodies a compact index to all the distinct, non-empty 
substrings of a given string. The overall space requirement of the suffix tree is 
linear in the length of the string it represents. Various approaches of building 
the substring index in linear time have been developed. McCreight’s algorithm 
builds a suffix tree in linear time and is space efficient |Q. Ukkonen 0 developed 
a linear-time, on-line suffix tree construction algorithm. 



3 The S“^-Tree 

The S^-TreE is motivated by the fact that searching substrings in a suffix tree 
takes an average 0(a -I- log n) disk accesses, where a is the size of the answer set. 
It would be desirable if the same technique can be used to solve the problem of 
subsequence matching of spatial objects and reduce its complexity. 

The major differences between spatial sequences and text strings are: (i) The 
alphabet of text strings usually consists of only a few discrete symbols (e.g. 
ASCII set); spatial sequences do not have a pre-defined “alphabet”; (ii) There is 
no relationship among symbols in a text string. While relationships could exist 
between two spatial objects, for example, contains and overlaps. 

The S^-Tree bridges the gap between spatial sequences and text strings by 
creating an alphabet that encodes spatial objects as well as the (containment) 
relationship. The S^-Tree is a combination of two trees (i) The X-tree, which 
provides a clustering method, according to which objects are converted into 
binary encodings that embody the containment relationship, (ii) The suffix tree, 
which implements subsequence matching on the binary sequences. 

For the rest of the paper, we shall use the following notational conventions. 
Unless otherwise specified, we use uppercase letters, Q, R, S, to denote spatial 
sequences and we use lowercase letters, a, b, c, to denote minimum bounding 
rectangles (MBRs) of spatial objects. 

[S'! the length of spatial sequence S. 

5'[i] the i‘^ entry of spatial sequence S. 

S[i,j] a subsequence that includes entries in position i through j. 

a C b MBR b contains MBR a. 

a' a binary encoding of MBR a 

S' a binary encoding of spatial sequence S. 

Given two spatial sequences P and Q, we say P matches Q ii P and Q are of 
the same length and each spatial object of P is contained by the corresponding 
spatial object of Q, i.e., P[i] C Q[i], for all 1 < i < |P|. Now, the problem of 
subsequence matching of spatial objects can be defined as follows: 

— We have N spatial sequences 5i, • • • , S'™, each of potentially different length. 

— We have a query subsequence Q of length \Q\. 

^ S^-Tree stands for Spatial Suffix Tree 
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— We want to find all the sequences Si, along with the correct offset k, such 
that the subsequence I = Si[k, k+ 1 Q| — 1] is enveloped by the query sequence, 
i.e., each spatial object in I is inside its corresponding spatial object in Q. 

When a query subsequence Q is given, a matching tolerance is specified im- 
plicitly at the same time. A bigger MBR represents a higher tolerance. The user 
has the freedom to enlarge/reduce the size of each MBR in the query subse- 
quence, i.e., the tolerance can be customized for different portions of the query 
subsequence. For example, to specify “don’t care”, the user can simply make 
some MBRs in the query subsequence as large as the universe so that each 
spatial object in the database matches the MBRs at those positions. 

3.1 Three Steps to Constructing the S^-Tree 

Creating an index structure. Given a set of spatial sequences, we add all the 
spatial objects in those sequences into a multidimensional index structure. The 
following are the major concerns when we choose our spatial access method: 

— Dimensionality . One goal of the S^-Tree is to index features extracted from 
databases of different contents. These features can have 3-20 dimensions. 
The index structure should be able to handle dimensions in this range. 

— Overlap. Overlap is the percentage of the volumn covered by more than one 
directory MBR. The S^-Tree maps an object into binary strings according 
to the hierarchy of the index structure. Minimizing the overlap is equivalent 
to minimizing the number of different mappings for each spatial object. 

— Space Utilization. The size of the index structure is another concern. Since 
the length of the encoded binary string depends on both the width and 
depth of the tree structure, maximizing storage utilization is equivalent to 
minimizing the total number and length of the binary strings. 

After comparing the i?-tree family access methods, we chose the X-tree for our 
index structure. The notion of supernode introduced by the X-tree creates a bal- 
ance between overlap and space utilization and is more suitable for our purpose. 



Encoding the X-tree. The root node of the X-tree is labeled with e, the empty 
string. An edge connecting a node with its k*^ child is labeled k (in binary, 
k > 0). Nodes other than the root are labeled with the concatenation of the 
labels on the edges connecting the root to that node. Thus, we have generated 
an alphabet A = {labels on all the nodes}. The following property holds for the 
prefix relationship among the symbols in the alphabet: 

Theorem 1. If a, f3 G A and a is a prefix of (3, then the MBR of the node 
labeled with a contains the MBR of the node labeled with j3. 

Proof: a is a prefix of (3, according to the encoding method, the node labeled with 
a must be an ancestor of the node labeled with j3. The property holds because in 
the X-tree, the MBR of a child node is contained by the MBR of its parent node. 
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For instance, we have a spatial sequence S = abode f ghijklmnop. Each sym- 
bol in the sequence is a point in a 2-dimensional space. Figure [T(a)| shows these 
points and Figure |I(b)| is the corresponding X-tree built on these points (minimal 
and maximal number of entries per node are 2 and 3, max overlap 20%). 




(a) Some 2-dimensional points orga- (b) A labeled X-tree. The leaves con- 
nized into an X-tree. tain pointers to the spatial objects. 



Fig. 1. Using X-tree to cluster spatial objects. 



Since the maximum branching factor of the X-tree in Figure [I(b)| is 3, we need 
no more than two bits to label an edge. Each node is coded as the concatenation 
of the labels on the path from the root to that node. Thus the leftmost leaf node 
to which points a and b belong is 0000, and the rightmost leaf node 1010. The 
alphabet A, composed of all the codes of the nodes, is as follows: 

A = {e, 00, 01, 10, 0000, 0001, 0100, 0101, 0110, 1000, 1001, 1010} 



Creating the suffix tree. Representing each spatial object in the origi- 
nal sequence with the code of the node it belongs to, we transforms S into 
S' using alphabet A (the sign at the end of S' marks the end of the sequence): 

s’ = 0000 • 0000 • 0100 ■ 0110 • 1000 ■ 0001 • 0001 • 0100 ■ 0110 • 1000 ■ 1010 • 1010 • 1001 
•1001 ■ 0101 • 0101 ■ $ 

or, if we write the binary code in decimal numbers: 

S'' = 0- 0- 4- 6- 8- l- l- 4- 6- 8-10-10-9-9-5-5-$ 

We construct a suffix tree in linear time for sequence S' using McCreight’s 
algorithm^. A partial suffix tree is shown in Figure |21 The pairs on the edges 
are the indices in sequence S' . For instance, (3, 5) represents subsequence 5"[3, 5], 
i.e., subsequence 4 • 6 • 8 in decimal, or 0100 • 0110 • 1000 in binary. The leaves 
are labeled with the start positions in S' of the suffixes that they represent. For 
example, subsequence S"[3, 5] = 4 • 6 • 8 can be found at offset 3 and 8 of S". 
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To construct a suffix tree for a set of spatial sequences, we glue them together 
into a long sequence by a special symbol and construct the suffix tree for the 
concatenated sequence. 




Fig. 2. The suffix tree built for sequence S' . Each edge is succinctly represented by a 
pair of indices in the data structure of the suffix tree. 

3.2 Encoding the Query Sequence 

Given a query sequence Q, we need to encode it into binary strings over alphabet 
A before we can do the search. A spatial object, or its MBR, may correspond to 
several symbols in the alphabet. We use procedure EncodeSpatialObject(mbr, 
root, set), where root is the root of the X-tree, to encode the mbr into a set 
of symbols. Figure 0 shows the algorithm. It performs a depth-first search of the 
X-tree, looking for the uppermost internal nodes contained entirely inside mbr, 
or leaf nodes that intersect mbr. For instance, a spatial object which contains 
the MBR of the root node will be encoded into a single symbol, e. 



Procedure EncodeSpatialObject(mbr, Node, symbol_set) 

Input: mbr is the MBR of a spatial object. Node is a node of the X-tree 

Input & Output: symbol_set is a set of binary characters in alphabet A 

01 if mbr contains the MBR of Node or Node is a leaf node then 

02 add the label of Node to symbol_set 

03 else 

04 for each child node c of Node do 

05 if the MBR of node c intersects mbr then 

06 EncodeSpatialObject(mbr , c, symbol_set) 

07 end if 

08 end for 

09 end if 



Fig. 3. EncodeSpatialObject() encodes a spatial object into a set of symbols. 

EncodeSpatialObject 0 maps each spatial object in the query sequence Q 
into a set of symbols, and we get a list of symbol sets, L, for the entire sequence 
Q. Encoding fails if any member of list L is an empty set. The user then will have 
to raise the tolerance, i.e., to enlarge the MBR at the corresponding position in 
the query sequence, in order to find any match in the database. 
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3.3 Subsequences Matching in the S'^-Tree 

The search algorithm (Figure E3) works as follows: Given L (the list of symbol-sets 
corresponding to a query sequence) , the algorithm performs a depth-first search 
of the suffix tree. At a certain node, if all the symbol sets have been matched 
{N = 0), then the labels of the leaf nodes that are descendants of that node will 
be the offsets of the query sequence Q found in S' . Otherwise, it calls Search () 
recursively on the sub-nodes whose labels match a prefix of L (line 07). 



Procedure symbolMatch(seq, L) 

Input: seq is a symbol sequence; L is a sequence of sets, and \L\ = \seq\ 

Output: SUCCESS if matches 

01 for i=l to \seq\ do 

02 if none of the symbols in set L[i] is a prefix of the symbol seq[i] 

03 return FAIL 

04 end if 

05 end for 

06 return SUCCESS 



Fig. 4. symbolMatch() decides whether a label can be matched by a symbol-set. 



In contrast to substring matching in traditional suffix trees, our searching 
algorithm will traverse multiple sub-branches of a node when more than one 
subsequence on the edges matches a prefix of L. We use symbolMatchO in 
Figure Q to determine the compatibility between a list of symbols and a list of 
symbol-sets generated by EncodeSpatialObjectO. Instead of exact matching, 
the partial order in the alphabet is used to decide whether the MBR of a symbol 
is contained in another MBR, thus allowing matching with a flexible tolerance. 

It is easy to prove that the above searching algorithm is correct, that is, it 
never misses qualifying subsequences. This is simply because neither the X-tree 
nor the suffix tree allows false dismissals. However, the result returned in the 
offset_set by the SearchO procedure may contain some “false alarms”, and 
they are discarded in the post-processing step (Section 13.41 1 . 

3.4 Minimizing False Alarms 

The SearchO procedure returns a superset of the qualifying subsequences. To 
filter out false alarms, we check each offset returned by SearchO to see if we 
have a valid match. 

This post-processing step is time-consuming when we have a high percentage 
of false alarms. False alarms are introduced during the encoding of the query 
sequence. Suppose a is one of the symbols which encodes a spatial object s in 
the query sequence, then a corresponds to a node JV in the X-tree. If 

— N is a leaf node. The MBR of N contains the MBR of s. Suppose N consists 
of k spatial objects, we may introduce as many as k false alarms by encoding 
s into a since it is possible that none of them is actually contained by s. 
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Procedure Search(Node, L, offset_set) 

Input: Node is a node in the suffix tree; L is a sequence of symbol sets 

Output: offset_set is the set of all the start positions in S of the subsequence 
match 

01 A ^ \L\ 

02 if = 0 then 

03 add all the labels on the leaves that are descendants of Node to off set_set 

04 end if 

05 for each child node c of Node do 

06 the pair index on the edge linking Node and its child c 

07 if symbolMatch(S''[i, j], L[l, j — i + 1]) = SUCCESS then 

08 SearchCc, L[j — i + 2,A], offset_set) 

09 end if 

10 end for 



Fig. 5. Search() performs subsequence matching on a suffix tree. 



— N is not a leaf node. No false alarm is introduced by this encoding. The 
MBR of the spatial object contains the MBR of node N, which means it 
contains all the MBRs of the spatial objects in the leaf nodes under node N. 

Thus, the number of false alarms is affected by the number of the objects 
contained in the leaf nodes of the X-tree. However, reducing the size of leaf nodes 
will bring the following disadvantages to subsequence matching in the suffix tree: 

— Smaller block size means we need more nodes to hold all the spatial objects. 
This translates to a larger alphabet, and a larger suffix tree. 

— Usually, the size of the MBR of the leaf node will decrease if it holds fewer 
objects. Thus, a spatial object will possibly be encoded into more symbols, 
and the symbolMatchO procedure will find more matchings, which means 
we need to traverse more sub-branches in the suffix tree. 

Hence, we need a balance between the number of entries in the leaf nodes 
and the potential size of the symbol set. To reduce the size of the symbol set, 
we prune the X-tree bottom-up. The pruning process picks on level h—1 a, node 
that has the smallest MBR among all the nodes on level ft. — 1, where ft is the 
height of the X-tree, and removes all its child nodes. Now this node becomes a 
new leaf node, which represents all the entries in its former child nodes. Thus, 
we reduce the size of the symbol set by increasing the number of the entries in 
the leaf node. We repeat this process until the size of the alphabet falls below 
a threshold value. We will study the relationship between the number of entries 
in the leaf node and the number of false alarms further in Section 0 



3.5 Repeated Subsequences 

The detection of repeated patterns in sequences is an important activity which 
crops up in a variety of different situations. However, most similarity-based time 



320 H. Wang and C.-S. Perng 



series matching algorithms find it difficult to answer queries such as “what is 
the longest common pattern of the two time series?” 

It is much easier to use the S'^-Tree for this purpose. To find the longest com- 
mon subsequences, it suffices to keep track of the maximum length of sequences 
represented by the internal-nodes during the construction of the suffix tree. To 
find the most frequently repeated subsequences of length k, it suffices to perform 
a walk of the suffix tree and compare the number of descendents of all the nodes 
which represent strings of length k. 

4 Subsequence Matching in Time-Series Databases 

Time series databases naturally arise in business as well as scientific decision- 
support applications. Most current time series subsequence matching algorithms 
can be seen as consisting of two phases: 

1 . Converting time series to points in feature space using DFT or other feature 
extraction methods. 

2. Using spatial access methods (e.g., i?*-tree) to store and retrieve the features. 

DFT is used in to map a time series to the frequency domain. It uses 
a moving window of size w, and features of the subsequence inside the moving 
window are extracted. Thus, a data sequence S is mapped to a trail consisting 
of IS”! — u> -I- 1 points in feature space. The trail is then divided into sub-trails, 
and the MBRs of the sub-trails are managed by the i?*-tree. 

We noticed several limitations of this pioneering work: 

1. The effectiveness of the ST-index is affected by the length of the query 
pattern. Since the ST-index ‘knows’ only about subsequences of length w, 
query patterns of length longer than w will be broken into sub-queries of 
length w. It then searches for subsequences that match at least one the 
sub-queries. This approach will clearly enlarge the searching space. 

2. The ST-index uses a fixed tolerance, e, for the entire query pattern. Users 
might want to have different e, (e.g., “don’t care”), for different parts. 

3. It is very difficult for the ST-index to detect “the most frequently repeated 
sequences of length fc?” or “the longest common pattern?” . 

4. The problems of amplitude scaling and offset translation are not addressed. 

To overcome these limitations, we use the S^-Tree in phase 2 for subsequence 
matching of time series data. The S^-Tree naturally overcomes the first three 
limitations mentioned above. 

However, in order to solve the problem of amplitude scaling and offset trans- 
lation, we need an improved feature extraction method in phase 1. In cni , we 
proposed a new feature extraction algorithm called Landmarks, which extracts 
features that are invariant under certain transformations. 

5 Experimental Evaluations 

We implemented the S^-Tree and ran our experiments on stock price spread- 
sheets from Yahoo!. Our environment is a SPARC 20 machine running Solaris 
2.7 with 128M memory. 
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5.1 False Alarms and the Size of the Alphabet 

Figure 0(a) shows that the false alarms drop dramatically when the size of the 
alphabet increases. (The database contains 10,000 spatial objects.) However, 
when the size of the alphabet continues to increase, the percentage of false alarms 
rebounds. The reason of this phenomenon is explained in Section EH 




(a) 



(b) 



Fig. 6. (a) Relationship between the size of the encoding alphabet and the percentage 
of false alarms, (b) Percentage of false alarms varying the length of the query sequence. 



5.2 Performance Comparison 

The use of the S'^-Tree index structure in phase 2 is independent of the feature 
extraction methods applied on the time series data in phase 1. The ST-index 
uses the first 3 DFT coefficients to map stock prices into the feature space. 

Figure EKb) shows the impact of the length of the query sequence on the 
number of false alarms. These experiments were carried out on points extracted 
by the DFT with window size w = 64. For each point in the query sequence, 
the S'^-Tree used a universal tolerance e, which was also used by the ST-index 
method in comparison. Since the ST-index method index only patterns of length 
w - the size of the moving window - false alarms increase when the length of 
the query sequence becomes longer. (For query sequence longer than w, the ST- 
index uses the ‘MultiPiece ’ algorithm, which searches space volume considerably 
smaller than the ‘PrefixSearch ’ algorithm, but the percentage of false alarm still 
increases.) To search for a longer query sequence using the S^-Tree, however, the 
percentage of false alarms decreases. This is because when we go deeper in the 
suffix tree an internal node will have a smaller number of leaf nodes under it . 

The S^-Tree outperforms ST-index for subsequence matching of time-series 
data. Figure Q(a) shows the relative response time of the ST-index method (Tg) 
vs. the S'^-Tree method {Tgs), both using DFT to extract features from the 
stock prices. The advantage of the S'^-Tree is also demonstrated by using the 
Landmarks methodfllj for feature extraction in Figure EKb), where Tg is the 
response time of the ST-index method, and T^s that of the S'^-Tree method using 
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Fig. 7. ST-index vs. S^-Tree, using different methods in phase 1. 



the Landmarks for feature extraction. The advantage of the S'^-Tree method is 
obvious when the query sequence is mapped to more than one spatial object. 



5.3 Longest Common Subsequences of Two or More Sequences 

We use the S^-Tree to search for longest common subsequences of two or more 
sequences. Figure 0shows two stock price curves that have the “double bottoms” 
characteristics. The Landmarks features of the curves are extracted into a 2- 
dimensional space: pv is the percentage of change between the previous landmark 
and the current one; vr is the ratio of changes between the previous period 
and the current one (for detail, see ^D|). The S'^-Tree successfully retrieved the 
“double bottoms” as the longest common subsequences of the two curves. 
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6 Conclusions 

In this paper, we have developed an index method, the - Tree, for subsequence 
matching of spatial objects. The insight is the observation that spatial sequences, 
as well as any other sequences that can be mapped into spatial sequences through 
feature extraction, are very similar to text strings when it comes to subsequence 
matching. Thus, we adapt the substring matching techniques, particularly the 
suffix tree index structure, to subsequence matching of spatial objects. We solved 
the problem of clustering and encoding spatial objects and most important of all, 
a partial order that denotes the containment relationship among spatial objects is 
retained in the encoding. Experiments on indexing time-series data show that by 
minimizing false alarms, our algorithm outperforms previous approaches. Also, 
the 5^ - Tree is capable of locating repeated subsequences and answering queries 
such as “What is the longest common pattern in the two time-series?” . 
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Abstract. This study proposes a data mining framework to discover qualitative 
and quantitative patterns in discrete- valued time series (DTS). In our method, 
there are three levels for mining similarity and periodicity patterns. At the first 
level, a structural-based search based on distance measure models is employed 
to find pattern structures; the second level performs a value-based search on the 
discovered patterns using local polynomial analysis; and then the third level based 
on hidden Markov-local polynomial models (HMLPMs), finds global patterns 
from a DTS set. We demonstrate our method on the analysis of “Exchange Rates 
Patterns” between the U.S. dollar and the United Kingdom Pound. 

Keywords: temporal data mining, discrete-valued time series, similarity patterns, 
periodicity analysis, local polynomial modelling, hidden Markov models. 



1 Introduction 

Temporal data mining is concerned with discovering qualitative and quantitative tem- 
poral patterns in a temporal database or in a discrete-valued time series (DTS) dataset. 
DTS commonly occur in temporal databases (e.g., the weekly salary of an employee, or 
a daily rainfall at a particular location). We identify two kinds of major problems that 
have been studied in temporal data mining: 

1. The similarity problem; finding fully or partially similar patterns in a DTS, and 

2. The periodicity problem: finding fully or partially periodic patterns in a DTS. 

Although there are various results to date on discovering periodic patterns and sim- 
ilarity patterns DTS datasets (e.g. 0|), a general theory and general method of data 
analysis of discovering patterns for DTS data analysis is not well known. 

Our proposed framework is based on a new model for discovering patterns by using 
hidden Markov models and local polynomial modelling. The first step of the framework 
consists of a distance measure function for discovering structural patterns (shapes). In this 
step, the rough shapes of patterns are only decided from the DTS and a distance measure 
is employed to compute the nearest neighbors (NN) to, or the closest candidates of, given 
patterns among the similar ones selected. In the second step, the degree of similarity 
and periodicity between the extracted patterns is measured based on local polynomial 
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models. The third step of the framework consists of a hidden Markov-local polynomial 
model for discovering all levels patterns based on results from the hrst two steps. 

The paper is organised as follows. Section 2 presents the dehnitions and basic meth- 
ods of hidden Markov models and local polynomial modelling. Section 3 presents our 
new method of hidden Markov-local polynomial models (HMLPM). Section 4 applies 
new models to “Daily Foreign Exchange Rates” data and section 5 discusses related 
work. The hnal section concludes the paper with a short summary. 

2 Definitions and Basic Methods 

We first give a definition of what we mean by DTS and some other notations will be 
introduced later. The basic models will be given here and studied in detail in the rest of 
the paper. 

Definition 1 Suppose that {12, 7”, 27} is a probability space and T is a discrete-valued 
time index set. If for any t S T there exists a random variable ^t{^) defined on {fl , F, E} 
then the family of random variables (w) , f G T}is called a discrete-valued time series 

(DTS). 

2.1 Definitions and Properties 

We consider the bivariate data (Xi,Yi), . . . y„) which form an independent and 

identically distributed sample from a population (X, Y). For given pairs of data (Xt, Yi), 
for i = 1,2, . . . , N,we can regard the data as generated from the model 

Y = to(X) -L cr(X)e 

where E(e) = 0, Var(e) = 1, and X and e are independent. 

We assume that for every successive pair of two time points in DTS, ti+i - f = fit) 
is a function (in most cases, f(t) = constant). For every successive three observations: 
Xj, Xj+i and Xy+ 2 , the triple value of (Yj, ly+i, Yj+ 2 ) has only nine distinct states 
(called local features) depending on changes in value. 

Let state: Sg be the same state as the prior one, Su the go-up state compared with 
the prior one and Sd the go-down state compared with the prior one, then we have state- 
space 5 = (si, s2, s3, s4, s5, s6, s7, s8, s9j = {(Yj, Su, Su), (Yj, Su, Sg), (Yj, Su, Sd), 
(Y„ Sg, Su), (Yj, Sg, Sg), (Y„ Sg, Sd), (Y„ Sd, Su), (Y„ Sd, Sg), (Y„ Sd, Sd) }. 

A sequence is called a full periodic sequence if every point in time contributes 
(precisely or approximately) to the cyclic behavior of the overall time series (that is, 
there are cyclic patterns with the same or different periods of repetition). 

A sequence is called a partial periodic sequence if the behavior of the sequence is 
periodic at some but not all points in the time series. 

Definition 2 Let h = {hi, h 2 , ■..} be a sequence. If for every hj G h, hj G S, then 
the sequence h is called a Structural Base sequence and a subsequence of h is called 
a sub-Structural Base sequence. If any subsequence hgut ofh is a periodic sequence, 
then hgub A called a sub-structural periodic sequence, h also is a structural periodic 
sequence (existence periodic pattern) s)). 
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Definition 3 Let y = {yi,y 2 -, ■ ■ ■ ■} be a real valued sequence. Then y is called a value - 
point process. For yj with 0 < t/j < 1 (" mod 1 ) for all j, we say that y is uniformly 
distributed if every subinterval of [0, 1] gets its fair share of the terms of the sequence 
in the long run. 

Definition 4 Let y = {j/i, y 2 , . . . .} be a sequence of real numbers with I — S < y^ < 
I + S, for all k, where I is a constant and 5 is an allowable variable parameter. We say 
that y has an approximate constant sequence distribution of y = {1,1, In general, 
if h(t) — S < yk < hit) + S for all k, we say that y has an approximate distribution 
function h(t). 



2.2 Hidden Markov Models (HMMs) 

In a hidden Markov model (HMM) an underlying and unobserved sequence of states 
follows a Markov chain with a finite state space and the probability distribution of the 
observation at any time is determined only by the current state of that Markov chain. 
In this subsection we briefly introduce the hidden Markov time series models which is 
limited to standard results taken from the literature. We have in particular used those of 
Baldi and Brunak O- 

Let {iSt : f G N} be an irreducible homogeneous Markov chain on the state space 
{1,2,..., to}, with transition probability matrix A. That is, A = {rjij ), where for all 
states i and j, and times t: 

riij = P{St=j\ St-i = i ) 

For {5't}, there exists a unique, strictly positive, stationary distribution 7 = ( 71 ,. . ., 7 ^), 
where we suppose {St} is stationary, so that 7 is, for all t, the distribution of St 

Suppose there exists a nonnegative random process l^t; t G Nj such that, conditional 
on S^'^^ = {St : t = 1, ..., Tj, the random variables : t = 1, . . . , Tj are 
mutually independent and, if St = i, ^t takes the value v with probability ttL. That is, 
for t = 1, . . . , T, the distribution of ^t conditional on is given by 

P(6 = = i)= nli 

where the probabilities ttL as the “state-dependent probabilities”. If the probabilities 
7 tL do not depend on t, the subscript t will be omitted. 

2.3 Local Polynomial Models (LPMs) 

The key idea of local modelling is explained in the context of least squares regression 
models. We use standard results from the local polynomial analysis theory which can be 
found from the literature on linear polynomial analysis (e.g, [BO- Recall the data model 
function given earlier; Y = to(X) + cr(X)e where E{e) = 0, Var(e) = 1, and X and e 
are independentQ. We approximate the unknown regression function m{x) locally by a 
polynomial of order p in a neighbourhood of xq. 



* We always denote the conditional variance of Y given X = a;o by (xo) and the density of 
Xhyfi-) 
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m{x) ~ rn{xo) + m! { xq){x — Xq) 



, (X - Xo)P. 

pi 



This polynomial is fitted locally by a weighted least squares regression problem: 

n p 

minimize{^{y* - xoyYKs{Xi - xq)}, 



i=l j=0 

where S is the same i5 as in definition 4, and Ks{-) with K a kernel function assigning 
weights to each datum point 0. 



3 Hidden Markov-Local Polynomial Models (HMLPMs) 

A real-world temporal dataset may contain different kinds of patterns such as complete 
and partial similarity patterns and periodicity patterns, and complete or partial different 
order patterns. There are many different techniques for efficient sequence or subsequence 
matching to find patterns in discrete-valued time series database (DTSB) (e.g, HJ). A 
limitation of those techniques is also that they do not provide a coherent language for 
expressing prior knowledge and handling uncertainty in the matching process. Also the 
existence of different patterns does not guarantee the existence of an explicit model. 

In this section we introduce our new data mining model for pattern analysis in a 
DTS by a combination of the hidden Markov models (HMMs) and local polynomial 
models (LPMs), called hidden Markov-local polynomial models (HMLPMs). HMMs 
have been successfully used in many applications, such as in isolated word recognition 
(see (TJ), but they have two major limitations. One is HMMs often have a large number 
of unstructured parameters, and the other is they cannot express dependencies between 
hidden states. In order to overcome the limitations of HMMs we apply local polynomial 
modelling techniques to relax the restrictive form of a HMM. We combine HMMs and 
LPMs to form hybrid models that contain the expressive power of artificial LPMs with 
the sequential time series aspect of HMMs. 

For building up our new data mining model we divide the data sequence or data 
vector sequence into two groups: (1) the structural-base data group and (2) the pure 
value-based data group. In group one we only consider the data sequence as a 9-state 
structural sequence by applying a distance measure function for performing structural 
pattern search. In group two, we use local polynomial techniques on the pure value-based 
sequence data for discovering pure value-based patterns. Then we combine those two 
groups by using hidden Markov models to obtain the final results. 

3.1 Modelling DTS 

Without loss of generality we assume that for each successive pair of time points in a 
DTS, we have ti+i -ti=c(a unit constant). According to our method the structural base 
sequence and value-point process data model become: 

U = to(V) + cr(V)£ 

where U is the number of yj of a given sample sequence. 



^ In section 4, we choose Epanechnikov kernel function: K{z) = | (1 — z^) for our experiments 
in pure- value pattern searching. 



328 



W. Lin, M.A. Orgun, and G.J. Williams 



Firstly we may view the structural base as a set of vector sequence {Vi, • • • , Vm}, 
where each Vj = (si, s2, s3, s4, s5, s6, s7, s8, s9)^ denotes the 9-dimensional obser- 
vation on an object that is to be assigned to a prespecified group. 

Then we may also view the value-point process model as a local polynomial model: 

y{x) = /?o + /3i(a; - xq) -b . . . , +[3p{x - x^Y + e. 

It is more convenient to work with matrix notation for the solution to the above least 
squares problem in section 2.3. Let 

/l(Xi-xo)---(X„-xo)n 

Vl(Xi-xo)'--(X„-xo)V 

and put Y = (Yi, • • • , and /3 = {f3o, ■ ■ ■ , PpV ■ 

Further, let W be the n x n diagonal matrix of the weights: 



W = diag{Ks{Xi - Xq)}. 

The solution vector is provided by weighted least squares theory and is given by 

/3 = (X^WX)-^X'^WY. 

Then the problem of value-point pattern discovery can be formulated as the local 
polynomial analysis of discrete-valued time series. 



3.2 Structural Pattern Discovery 

We now introduce an approach to discovering patterns in structural base sequences which 
uses a distance measure function with its density estimator. 

From the point of view of our method in structural sequence data analysis, we 
use squared distance functions which are provided by a class of positive semidefinite 
quadratic forms. Specifically, if u = (ui , U£, ■ ■ ■ , ug) denotes the 9-dimensional obser- 
vation of each different distance of patterns in a state on an object that is to be assigned 
to one of the g prespecified groups, then, for measuring the squared distance between u 
and the centroid of the ith group, we can consider the function H 

D^{i) = {u-y)'M{u-y) 

where M is a positive semidefinite matrix to ensure the (*) > 0. 

3.3 Point- Value Pattern Discovery 

Here we introduce an enhancement to the local polynomial modelling approach through 
functional data analysis. On the value-point pattern discovery, given the bivariate data 
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(Xi, Yi), • • •, {Xn, Yn), one can replace the weighted least squares regression function 
in section 2.3 hy 

n p 

- Xoy}Kh{X, - xo) 

i=l j=0 

where £{■) is a loss function. For the purpose of predicting future values we use a special 
case of the above function with ia{t) = |f| + (2a — l)t0. 

3.4 Using HMLPMs for Pattern Discovery 

For using HMLPMs in pattern discovery we combine the above two kinds of pattern 
discovery. In structural pattern searching let the structural sequence {Vt : f G N} be 
an irreducible homogeneous Markov chain on the state space {si, s2, . . . , s9}, with the 
transition probability matrix A (see section 2.1 for details). 

In value-point pattern searching suppose the pure valued data sequence is a non- 
negative random process (C^; tG N} such that, conditional on = {Vt : t — 
1, ..., T}, the random variables {Ct : t — 1, ..., T} are mutually independent and, 
if St = i, Ct takes the value v with probability ttL. That is, for t = 1, . . . , T, the 
distribution of Ct conditional on is given by 

P{Ct = v\Vt = i) = ttS 

Suppose that if Vt = i, ^t has a local polynomial distribution with parameters Up^t (a 
known positive integer) and pi. That is, the conditional local polynomial distribution of 
has parameters Up^t and m{t), where 

m 

m{t) = 

i=l 

and Wi{t) is, as before, the indicator of the event {Vt = i}. Then we have “state- 
dependent probabilities” for each nine states (n = 0,l,...,ripf) 

The models {^t} are dehned as hidden Markov-local polynomial models. In this case 
there are parameters: m parameters Ai or pi, and — m transition probabilities 
rjij, e.g. the off-diagonal elements of A, to specify the “hidden Markov chain” {St}. 

4 Experimental Results 

This section presents selected experimental results. There are three steps of experiments 
for the investigation of “Daily Foreign Exchange Rates’Q analysis of “Exchange Rates 
Patterns” between the U. S. dollar and the U. K. pound. The data consist of daily exchange 
rate for each business day between 2 January 1971 and 21 June 1999. The time series is 
plotted In figure [I] 

^ This is often called quantile regression. 

The Federal Reserve Bank of New York for trade weighted value of the dollar = index 
of weighted average exchange value of U.S. dollar against the United Kingdom Pound: 
http : //www. frbchi . org/econinf o/f inance/f inance .html. 
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Fig. 1. 5764 working days exchange rates between the U. S. dollar and the U. K. pound, since 
1971. 

4.1 On Structural Pattern Searching 

We investigate the sample of the structural base to test naturalness of the similarity and 
periodicity on the Structural Base distribution. The size of this discrete- valued time series 
is about 5764 points. We consider 9 states in the state-space of structural distribution: 
S = {si, s2, s3, s4, s5, s6, s7, s8, s9}. 




Fig. 2. Left: plot of the distance between same state for all 9 states in 5764 business days. Right: 
plot of the distance between same state for all 9 states in first 300 business days. 



In FigureQ each point represents the occurence of one of the nine transition states, 
retaining the original order of the states. There exist two approximation uniformly dis- 
tributed on state 3 and state 7 if the observations are big enough. Figure Qalso explains 
two facts: (1) there exists a hidden periodic distribution which corresponds to patterns 
on the same line with different distances, and (2) there exist partial periodic patterns on 
and between the same lines. To explain this further, we can look at the plot of distances 
between the patterns at a finer granularity over a selected portion of the daily exchange 
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rates. For instance, in the right of Figure |3 the dataset consists of daily exchange rates 
for 300 business days starting from 3 January 1983, telling us there exist a number of 
partial periodic patterns appearing in each year and, also telling us in each state in a 
year there is a hidden periodic and similarity distribution with each point representing 
the distance of patterns of various forms. Between some combined pattern classes there 
exist similar patterns such as between 5 to 10 and 15 to 22; between 32 to 35 and 42 to 
44. 

In FigureElthe x-axis represents how many times the same distance is found between 
repeating patterns and the y-axis represents the distance between the first and second 
occurences of each repeating pattern. In other words, we classify repeating patterns 
based on a distance classihcation technique. Again we can look at the plot over a selected 
portion to observe the distribution of distances in more detail. For example, in the right of 
hgureElthe dataset consists of daily exchange rates for the hrst 50 business days. It can be 
observed that the distribution of distances is a cubic curve distribution: u = — , , 

where A = ax^ H- bx H- c and Z\, b < 0, a > 0. 
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Fig. 3. Left: plot of the distance between same state for all states in 5764 business days. Right: 
plot different pattern appear in different distances for first 50 business days. 



In summary, some results for the structural base experiments are as follows: 

- Structural distribution is a hidden periodic distribution with a periodic length func- 
tion /(f) (there are techniques available to approximate to the form of this function 
such as higher-order polynomial functions). 

- There exist some partial periodic patterns based on a distance shifting. 

- For all kinds of distance functions there exist a cubic curve: y = where 

A = ax^ H- bx -H c and Z\, b < 0, a > 0. 

- there exists an approximate uniform distribution in state 3 and state 7. 

4.2 On Value-Point Pattern Searching 

We now illustrate our new method to construct predictive intervals on the value-point 
sequence for searching periodic and similarity patterns. The linear regression of value- 
point of Xt against Xt-i explains about 99% of the variability of the data sequence, but 
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it does not help us much in analysis and predicting future exchange rates. In the light 
of our structural base experiments, we have found that the series Yt = Xt — Xt -2 has 
non-trivial autocorrelation. The correlation between Yj and Yt_i is 0.5268. Then the 
observations can be modelled as a polynomial regression function, say 

Yt = Xt - Xt_2 + <7{Xt)et, t = l,2,...,N 

and then the following new series 

y(t) = Y{t) + Y{t-l)+et> t=l,2,...,N 

may be obtained. We also consider the e{t) as an auto-regression AR{2) model 

£(' = ast'-i + bet'-2 + et' 

where a, b are constants dependent on sample dataset, and et' with a small variance 
constant which can be used to improve the predictive equation. Our analysis is focused 
on the series Yt which is presented in the left of Figure 0 It is scatter plot of lag 2 
differences: Yt against Yt_i. 

We obtain the exchange rates model according to nonparametric quantile regression 
theory: 

Yt = 0A8SYt-i + et 

From the distribution of St, the e{t) can be modelled as an AR{2) 

St = 0.261et_i — 0.386£(_2 + &t 

with a small Var(et)(about 0.00093) to improve the predictive equation. 

For prediction of future exchange rates for the next 210 business days, we use the 
simple equation y* = 0.488Yt_i with an average error of 0.00 135. In the right of Figure^] 
the actually observed series and predicted series are shown. 

Some results for the value-point of experiments are as follows: 

- There does not exist any full periodic pattern, but there exist some partial periodic 
patterns based on a distance shifting. 

- There exist some similarity patterns with a small distance shifting. 



4.3 Using HMLPMs for Pattern Searching 

Let {St : St G S,t G N} be an irreducible homogeneous Markov chain on the state 
space {si, s2, s3, s4, s5, s6, s7, s8, s9}, with transition probability matrix (TPM) (or, 
stochastic matrix) A: 



A = 



0.5186 0.0208 0.4606 0 

0 0 0 0.5161 

0 0 0 0 
0.4118 0.0588 0.5294 0 

0 0 0 1 
0 0 0 0 
0.5355 0.0260 0.4385 0 

0 0 0 0.4359 

0 0 0 0 



0 0 0 0 0 

0323 0.4516 0 0 0 

0 0 0.5146 0.0362 0.4492 

0 0 0 0 0 

0 0 0 0 0 

0 0 0.5 0 0.5 

0 0 0 0 0 

0 0.5641 0 0 0 

0 0 0.4962 0.0342 0.4696 
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Fig. 4. Left: Scatter plot of lag 2 differences: Yt against Yt-i. Right:Plot of future exchange rates 
only for 210 business days by using the simple equation Yt = 0.488 Lt-i . 



We are interested in the future of distribution of TPM, f{t) = A* . According to the 
Markov property, the TPM: limt_>oo A*' = 0. This means that the TPM is non-recurrent 
of a state si to a state sj. In other words, we cannot use present exchange rate to predict 
future exchange rate of some period after, but we are only able to predict near future 
exchange rate. 

Suppose that our prediction of future exchange rate of value-point sequence is a 
nonnegative random process {Ct, t G N}, and satisfy Zt = aZt_i + 9t- 

Suppose the distribution of sequence of transition probability matrix (TPM) under 
time order Z\i, Z\ 2 , • • • , At, f G N corresponding to the prediction value-point Zt = 
Yt - Yt_2. 

We have main combined-results on exchange rates as follows: 

- We are only able to predict a short future period by using all present information. 

- There does not exist any full periodic pattern but there exist some partial periodic 
patterns. 

- There exist some similarity patterns with a small distance shifting. 



5 Related Work 

According to pattern theory objectives in pattern searching can be classified into three 
categories: 

- Create a representation in terms of algebraic systems with probabilistic superstruc- 
tures intended for the representation and understanding of patterns in nature and 
science. 

- Analyse the regular structures from the perspectives of mathematical theory. 

- Apply regular structures to particular applications and implement the structures by 
algorithms and code. 
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In recent years various studies have considered temporal datasets for searching dif- 
ferent kinds of and/or different levels of patterns. These studies have only covered one 
or two of the above categories. For example, many researchers use statistical techniques 
such as Metric-distance based techniques. Model-based techniques, or a combination 
of techniques (e.g, fHln to search for different pattern problems such as in periodic 
patterns searching (e.g., 0 ) and similarity pattern searching (e.g., 0 |). 

Some studies have covered the above three categories for searching patterns in data 
mining. For instance (1 presents a “shape definition language”, called SVC, for re- 
trieving objects based on shapes contained in the histories associated with these objects. 
Also liT^ present a logic algorithm for finding and representing hidden patterns. In |0, 
authors described adaptive methods which are based on similar methods for finding rules 
and discovering local patterns. 

Our work is different from these works. First, we use a statistical language to perform 
all the search work. Second, we divide the data sequence or, data vector sequence, into 
two groups: one is the structural base group and the other is the pure value based group. 
In group one our techniques are similar to Agrawal’s work but we only consider three 
state changes (i.e., up (value increases), down (value decreases) and same (no change)) 
whereas Agarwal considers eight state changes (i.e., up (slightly increasing value). Up 
(highly increasing value), down (slightly increasing value) and so on). In this group, we 
also use distance measuring functions on structural based sequences which is similar to 
m- In group two we apply statistical techniques such as local polynomial modelling 
to deal with pure data which is similar to ||5|. Finally, our work combines significant 
information of two groups to get global information which is behind the dataset. 



6 Concluding Remarks 

This paper has presented a new approach combining hidden Markov models and local 
polynomial analysis to form new models of application of data mining. The rough deci- 
sion for pattern discovery comes from the structural level that is a collection of certain 
predefined similarity patterns. The clusters of similarity patterns are computed in this 
level by the choice of certain distance measures. The point-value patterns are decided 
in the second level and the similarity and periodicity of a DTS are extracted. In the 
final level we combine structural and value-point pattern searching into the HMLPM 
model to obtain a global pattern picture and understand the patterns in a dataset better. 
Another approach to find similar and periodic patterns has been reported else where 
USUI- With these the model used is based on hidden periodicity analysis and plocal 
polynormial analysis. However, we have found that using different models at different 
levels produces better results. 

The “Daily Foreign Exchanges Rates” data was used to find the similar patterns and 
periodicities. The existence of similarity and partially periodic patterns are observed 
even though there is no clear full periodicity in this analysis. 

The method guarantees finding different patterns if they exist with structural and 
valued probability distribution of a real-dataset. The results of preliminary experiments 
are promising and we are currently applying the method to large realistic data sets such 
as two kinds of diabetes dataset. 
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Abstract. Complete or partial periodicity search in time-series 
databases is an interesting data mining problem. Most previous studies 
on finding periodic or partial periodic patterns focused on data struc- 
tures and computing issues. Analysis of long-term or short-term trends 
over different time windows is a great interest. This paper presents a 
new approach to discovery of periodic patterns from time-series with 
trends based on time-series decomposition. Eirst, we decompose time 
series into three components, seasonal, trend and noise. Second, with 
an existing partial periodicity search algorithm, we search either par- 
tial periodic patterns from trends without seasonal component or partial 
periodic patterns for seasonal components. Different patterns from any 
combination of the three decomposed time-series can be found using this 
approach. Examples show that our approach is more flexible and suitable 
to mine periodic patterns from time-series with trends than the previous 
reported methods. 



1 Introduction 

Time-series data is frequently encountered in real world applications. Stock price, 
economic growth and weather forecast are typical examples. A time-series rep- 
resents a set of consecutive observations or measurements taken at certain time 
intervals. The observations can be real numerical values, for instance, a stock 
price, or categories, for instance, different medicines taken by a patient over a 
treatment period. The continuous values of a time-series can be converted to 
categories if needed. 

Discovery of interesting patterns from a large number of time-series is re- 
quired in many applications. In disease management, for example, the patterns 
of the best treatment on diabetes in different age groups and genders can be 
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discovered from the health insurance time-series data |H|. Such information is 
extremely useful to the government in developing sound health policies. In as- 
tronomical science, the microlensing event patterns are hidden in millions of 
time series measured from 20 million stars over many years m- The discovery 
of such pattern provides a strong evidence on the existence of the “dark matter” 
in the universe. 

One of the tasks for time-series data mining is search for periodic patterns 
m- The problem is interesting because many patterns in time-series data are 
periodic or partially periodic. Han et al recently described some methods for 
periodic pattern discovery from time-series with observations as categories m- 
When their methods are applied to time-series with observations as continuous 
values, a conversion of values to categories is needed. Such conversion can be 
easily done on the time-series without trends and their methods are directly 
applicable to the converted time series. However, problems occur when the time- 
series have trends because some hiding cyclic patterns may be converted to 
different categories due to the trend effect. Direct application of Han et al’s 
methods to these time-series will miss a lot of partially periodic patterns. In 
fact, many time-series in the real world have trends. For example, the stock 
value of an equity fund is increasing in the long run although its price goes 
up and down over time. Our motivation in this work is to enhance Han et al’s 
methods m and make them applicable to the time-series with trends. 

In this paper, we present a method to discover periodic patterns from time- 
series with trends. We first use a time-series decomposition technique to decom- 
pose time-series with trends into three components, the seasonal component, 
the trend and the noise. Then, we apply the algorithms in UM to the seasonal 
time-series to discover the periodic patterns. After that, we divide the trend 
component into different categories of regions, such as “increasing” , “flat” and 
“decreasing” . Finally, we combine the periodic patterns with the trend cate- 
gories to form conditional patterns. For example, pattern C(p,q,s) occurs when 
the trend is “decreasing”. Here p is the period of the pattern, q is the offset 
indicating the first time stamp at which the pattern occurs and s is the number 
of cycles of the pattern. 

Our contribution in this paper is to introduce the time-series decomposi- 
tion technique, in particular, the STL method, into the process of discovery of 
periodic patterns from time-series with trends. Our approach has two major 
advantages: (1) we can discover periodic patterns from time-series with trends, 
while the previous methods will miss some partially periodic patterns in the areas 
where the trend increases or decreases significantly, and (2) we discover periodic 
patterns with trend conditions which can reveal more useful information about 
the patterns. For example, pattern A occurs when the trend increases. However, 
such information is not given in the previous approach. For time-series without 
trends, our approach is similar to those in PE! except that we consider noise. 
In this sense, our patterns are noise-resistant. 
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1.1 Related Work 

Most of the reported work concentrated on symbolic patterns. In P, an Apriori- 
like technique was developed for mining sequential patterns. In Cl, the authors 
studied frequent episodes in sequences, where episodes are essentially acyclic 
graphs of events whose edges specify the temporal before-and-after relation- 
ship but without timing-interval restrictions. In P), a generalization of inter- 
transaction rules was studied where the left hand and right-hand sides of a 
rule are episodes with time-interval restrictions. In multi-dimensional inter- 
transaction rule mining was studied. The mining of cyclic association rules con- 
sidered the mining of some patterns of a range of possible periods m Because 
cyclic association rules are partial periodic patterns with perfect periodicity in 
the sense that each pattern re-occurs in every cycle with 100% confidence, par- 
tial periodic patterns were studied in m- Recently, in |0], the authors stud- 
ied how to identify representative trends in massive time-series data sets using 
sketches. Informally, an interval of observations in as time-series is defined as 
a representative trend if its distance from other intervals satisfy certain prop- 
erties, for suitably defined distance functions between time-series intervals. The 
problems with mm are that they mine patterns based on what appears in 
the time-series data as they are. They do not mine patterns based on the global 
cycle-trends, seasonal and irregular patterns, as a whole and individually. 

1.2 Paper Organization 

The definitions of periodic patterns in time-series data are given in Section 0 In 
Section 0, we present the time series decomposition techniques and describe the 
STL approach for time series decomposition in details. In Section 0 we discuss 
the periodic pattern discovery methods from decomposed time-series and show 
some real examples. Our concluding remarks on this work are given in Section 

E 



2 Periodic Patterns in Time-Series Database 

A time-series is a set of observations Xt, each one being recorded at a specific 
time stamp t. A discrete-time time-series is one in which the set of times at 
which observations are made is a discrete set, as is the case for example when 
observations are made at fixed time intervals. In this paper, we consider a time- 
series of real numbers: 

Xi,X2,- • • ,x„, Xi G R. 

For any time-series, the time stamps at ti and t 2 are called similar if their time- 
related values at these two time stamps are the same, i.e., Xt^ = Xt^. We follow 
the similar definitions given in [TWI and define a periodic pattern as follows. 

Definition 1. For any given time-series if there exist positive integers 

p and q (0 < p,q < n) and s (0 < s < n/ p) such that all the time stamps pr q 
are similar for 0 < r < s — 1, we call 



{Xq, Xp-^-q, X2p+q, ' ' ’ , 
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a cycle, denoted by C = (p, q, s). Here p is the length or period of the eycle, q is 
the offset indicating the first time stamp at which the cycle occurs and s is the 
number of time stamps which are similar in the eycle. 

If there are m cycles with the same period p, then these m cycles form a 
partial periodic pattern with period p. However, if the number of cycles equals 
to the pattern length, then a complete periodic pattern is formed. 

Definition 2. For any given time-series if there exist m cycles 

Cl = (P,<?TSi),C2 = (p,q2,S2),- ■ ■ ,Cm = {p,qm,Sm) 

such that the number of eycles m in a pattern equals to the pattern length 
maxi<i<m qi — mini<i<rn qi 1, then we refer to such a pattern Ci,C' 2 , • • • , Cm 
as a complete periodic pattern, denoted by 

C(p, min {gj, mm {s,}). 

l<2<m l<2<m 



Definition 3. A time-series eontains a cycle C = (p, q, s) with a confidence a 
if there are a ■ s time stamps which are similar. Similarly, a time-series eontains 
a partial periodie pattern with a confidence (3 if there are ft ■ m eycles with the 
period p. 

In pi6| . J. Han et al. developed an efficient method for mining periodicity 
in time-series database by using Apriori mining technique. They also showed 
that data cube structure provides an efficient and effective structure. However, 
their algorithm is not effective when we apply to a time-series contains a “trend 
component”. For instance, consider the data sequence in Figure Q which shows 
468 monthly observations on Atmospheric concentrations of C02 from 1959 to 
1997. The top panel is the original time-series. If we apply their algorithm to 
the time series in the top panel, we cannot find any periodic patterns. The main 
reason is that there is a hidden trend in the data sequence. Based on these 
observations, we propose to identify a “trend component” in a time-series before 
we apply data mining algorithm to find periodic patterns. 

3 Time-Series Decomposition 

In this section, we provide some notations and background information for time- 
series decomposition mg. Time-series decomposition assumes that the data are 
made up as patterns and errors. Typically, there are three patterns, namely, 
cyclical pattern, trend pattern and seasonal pattern. 

— A cyclical pattern exists when the data exhibit rises and falls that are not 
of a fixed period. 

— A trend pattern exists when there is a long-term increase or decrease in the 
data. 

— A seasonal pattern exists when a series is influenced by seasonal factors. 
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Fig. 1. Atmospheric concentrations of C02: A time-series of 468 observations; monthly 
from 1959 to 1997. 



An error is the difference between the combined effect of the above trend-cycle- 
seasonality and the actual data. It is often called the irregular or the remain- 
der component. Because the distinction between trend and cycle is somewhat 
artificial, most decomposition procedures treat the trend and cycle as a single 
component called trend-cycle. An example of time-series decomposition is shown 
in Figure ^ 

The general mathematical representation of a time-series decomposition ap- 
proach is given as Xt = f{yt,ht,et) where Xt is the time series value (actual 
data) at period t, yt is the seasonal component (or index) at period t, ht is 
the trend-cycle component at period t, and et is the irregular component at pe- 
riod t. Two common approaches are additive decomposition and multiplicative 
decomposition . 

— additive decomposition: xt = yt + ht + et- 

— multiplicative decomposition: Xt = yt x ht x et- 

In fact, logarithms turn a multiplicative relationship into an additive relation- 
ship. Therefore, it is possible to fit a multiplicative relationship by fitting an 
additive relationship to the logarithms of the data. 

The classical time-series decomposition was developed in 1920s, there are 
many variants being developed m3- The Census II method has been developed 
by the U.S. Bureau of the Census. One of the most widely used variants is 
X-12-ARIMA (O]. X-12-ARIMA uses shorter weighted moving averages to pro- 
vide estimates for the observations at the beginning and end of the series, and 
provides the facility to extend the original series with forecasts to ensure that 
more of the observations are adjusted using the full weighted moving averages. 
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These forecasts are obtained using an ARIMA (Autoregressive Integrated Mov- 
ing Average) model. The STL decomposition method was proposed in 1990 as 
an alternative to Census II for seasonal-trend decomposition procedure based on 
Loess Pj. STL consists of a sequence of applications of the Loess smoother to 
give a decomposition that is highly resistant to extreme observations. In addition, 
STL is capable of handling seasonal time-series where the length of seasonality 
is other than quarterly or monthly. In fact, any seasonal period n > 1 is allowed. 

In this study, we adopt the STL approach for time-series decomposition. We 
outline the STL procedure following the discussions in [3111)) . The STL procedure 
consists of two loops, namely, an inner loop and an outer loop. The inner loop 
performs six basic steps. 

1. A de-trended series is computed by subtracting the trend estimated from 
the original data. Xt — ht = yt + where initially ht is set to be zero. 

2. The de-trended values for each point in every window are collected to form 
sub-series. Each of the sub-series is smoothed by a Loess smoother. The 
smoothed sub-series are glued back together to form a preliminary seasonal 
component. 

3. A moving average is applied to the preliminary seasonal component. The re- 
sult is in turn smoothed by a Loess smoother again. The purpose of this step 
is to identify any trend-cycle that may have contaminated the preliminary 
seasonal component in the previous step. 

4. The seasonal component is estimated as the difference between the prelimi- 
nary seasonal component of the second step and the seasonal component in 
of the third step. 

5. A seasonally adjusted series is computed by subtracting the result of the 
fourth step from the original data {xt — yt = ht + e*). 

6. The seasonally adjusted series is smoothed by Loess to give the trend com- 
ponent ht- 

The outer loop begins with one of two iterations of the inner loop. The 
resulting estimates of trend-cycle and seasonal components are then used to 
calculate the irregular component: et = Xt — ht — yt- Large values of e* indicate 
an extreme observation. These are identified and a weight is calculated. That 
concludes the outer loop. Further iterations of the inner loop use the weights in 
the second step of and the sixth step of the inner loop to downweight the effect 
of extreme values. Further iterations of the inner loop begin with the trend 
component from the previous iteration. 

4 Patterns Discovery from Decomposed Time-Series 

In this section, we make use the decomposition of time-series to mine periodic 
patterns. The outline of our approach is given below. 

— Step-1: Decompose the time-series {xtjtLi into three parts: Xt = yt + ht + et 
using the STL procedure as discussed in the previous section. 

— Step-2 : Mine the periodic patterns for the time-series {yt}t'=i, using an 
existing approach m- 



342 J.X. Yu, M.K. Ng, and J.Z. Huang 



— Step-3 : Mine the rules from the combination of the trend from /it, the 
periodic patterns from yt and the error from et- 



Table 1. A time sequence example. 



t 


1 


2 


3 


4 


5 


6 


Xt 


1.0000 


2.7071 


4.0000 


4.7071 


5.0000 


5.2929 


t 


7 


8 


9 


10 


11 


12 


Xt 


6.0000 


7.2929 


9.0000 


10.707 


12.000 


12.7071 




Fig. 2. Decomposition for Tabled 



Consider a simple time-series, {cctjtii, in Tabled where each xt is generated 
as (/— l)-|-sin(7r(t— 1)/4). Suppose that we quantify them as (1, 3,4, 5, 5, 6, 7, 9, 10, 
12, 13). Obviously, there are no any partial cyclic patterns from this sequence 
due to the trend. After STL decomposition. Figure 121 shows the three time-series, 
namely, {yt}lh (seasonal), {/it}t£i (trend), {etjlii (error), where Xt = yt + ht + 
et- Suppose that the quantified sequence of yt is (0, 5, 8, 8, 0, 5, 8, 8, 0, 5, 8, 8). Ap- 
plied the partial periodic mining algorithm [Z| to yt, we can identity periodic 
patterns, for example, a complete periodic pattern C(4, 1,3). However, the in- 
terpretation of periodic patterns from the seasonal time-series, yt is different. 
As can be seen from Figure |3 the values of seasonal component is in the range 
-5.0 and 3.0. As mentioned by Hyndman m, for additive decomposition, the 
seasonal component is added to the trend, and can be positive or negative. The 
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seasonal component represents the amount to be added on average each season. 
If this is negative, it means the season is lower than average. If it is positive, it 
means the season is higher than average. Therefore, the patterns we find so far 
are based on it under the trend. In addition, we can classify a trend as “slowly 
increasing”, “increasing”, “increasing quickly”, “no change”, etc. It is worth not- 
ing that there are possible periodic patterns in a trend where periodic patterns 
are for a long term. A periodic pattern C{p, q, s) appears in yt with the trend ht 
needs to be justified with statistical significance. For instance, the least squares 
errors for the above time-series yt is X)t=ti = 1.47. In a similar fashion, for 
the atmospheric concentrations of C02 shown in the top panel of Figured there 
are no particular periodic patterns existing in the original time-series. However, 
after STL decomposition, we find a period pattern over years. The least squares 
error is 21.27. 

In general, the following statement can be made on periodic pattern mining 
on the seasonal component. The periodic patterns are found on a time-series, 
yt (seasonal component), with a specific trend and with a statistical significance 
associated like the least squares error. The advantages of our approach are given 
as follows. 

— Without any time-series decomposition, it can find the same partial periodic 
patterns using the algorithm in [7j- 

— After STL time-series decomposition, xt = yt + ht + et, it can further find 
the following patterns. 

— Find partial period patterns from yt under the trend ht with the least 
squares error. It suggests that we are now possible to look at the seasonal 
components if that is required. 

— Find partial period patterns from the seasonal adjustment Xt — yt- It is 
required because many published economic series are seasonally adjusted 
for the reasons that seasonal variation is typically not of the primary 
interest in that context. 

— Simplify our tasks to compare time-series. With the time-series decom- 
position approach, we can compare two time-series, x't and x’b based on 
their original time-series, the trend component, the seasonal component, 
and any meaningful combinations. 

However, care must be taken in using the periodic patterns discovery tech- 
nique based on time-series decomposition. Figure EEE and 0 show four dif- 
ferent stocks taken in different length of period from the 1997 CRSP US stock 
databases, their trend components, seasonal components and errors. 

In the top panel of Figure Ej we can see some possible periodic patterns in 
the original time-series. After decomposition, the periodic patterns in the sea- 
sonal component are more visible. Both the original time-series and the seasonal 
component have some partial periodic components. But they are different. First, 
the fluctuation is different. Second, the patterns from the seasonal component 
are based on the seasonality, where the patterns from the original time-series in- 
cluded both factors of long-term trend and seasonality. Third, the patterns from 
the seasonal component is found under the trend with a statistical significance 
like the least squares error. After removing the seasonal component, the trend 
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time 



Fig. 3. Stock price for Stock-A. The time dimension is the i-th working day from Nov. 
1, 1988 to Oct. 5, 1990. (5 working days a week), taken from [^. 




time 



Fig. 4. Stock price for Stock-B. The time dimension is the i-th working day from Jul. 
2, 1962 to Dec. 31, 1997. (5 working days a week), taken from [n|. 
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Fig. 5. Stock price for Stock-C. The time dimension is the i-th working day from Jnn. 
11, 1997 to Dec. 31, 1997. (5 working days a week), taken from |^. 




Fig. 6. Stock price for Stock-D. The time dimension is the f-th working day from Jul. 
2, 1962 to Apr. 30, 1968. (5 working days a week), taken from j^. 
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is much smooth. Some periodic patterns can be found in the trend component. 
The trend component helps finding other stocks with the similar trend, and help 
comparing their seasonal behaviors. As can be seen from this example, we are 
possible to search patterns in the original time-series, its trend component and 
its seasonal component. 

In Figure 0, we observe some periodic patterns in the original time-series. 
However, they are very sensitive to the trend. We find that after the decom- 
position, the “trend” time-series contains “up” and “down” movements. If the 
period that a user chooses matches the trend, it is possible to find more periodic 
patterns. But, in fact, the period is not easy to determine. People usually use 
week, month, quarter or year as period to search periodic patterns, which might 
be difficult to find patterns in this case. Searching periodic patterns in the “sea- 
sonal” time-series does not completely remove the sensitivity of the period. But 
the issue become less important. In addition, some existing statistic approaches 
can be used to determine the period for time-series decomposition. 

In Figure El we cannot conclude any periodic patterns in original time series 
and the “seasonal” time series obtained by decomposition. The main reason is 
that there is sudden change of level in the original time series. This suggests us 
to partition the original time series into two parts and study each of them. This 
approach has also been proposed in non-parametric regression in statistics. 

In Figure El we can view this case as a combination of the two examples 
in Figures 0 and El Based on the above observations, we find that Han’s par- 
tial periodic algorithm cannot handle all real applications very well. However, 
our approach using decomposition of time series and partitioning of time series 
provides users with more flexibility to discover useful periodic patterns. 

5 Concluding Remarks 

In this paper, we present an approach for periodic pattern searching based on 
time-series decomposition, STL, which is capable of handling any length of sea- 
sonality. The computing issue is not the main issue in this paper. In this study, 
we focus on the flexibility of periodic pattern search, in particular when there are 
trends. It is because in many real applications, there are some typical trends. For 
example, the airline passengers increase for many years, the web-access rapidly 
increases. It is difficult to find periodic patterns from those time-series data. 
With decomposition, we are possible to search periodic patterns in the seasonal 
component and the trend component as well as in the original time-series. We 
conducted a preliminary study using US CRSP STOCK database 0. Our ap- 
proach has flexibility to find periodic patterns in different contexts. 
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Abstract. Proximity and density information modeling of 2D point- 
data by Delaunay Diagrams has delivered a powerful exploratory and 
argument-free clustering algorithm pj for geographical data mining P|. 
The algorithm obtains cluster boundaries using a Short-Long criterion 
and detects non-convex clusters, high and low density clusters, clusters 
inside clusters and many other robust results. Moreover, its computation 
is linear in the size of the graph used. This paper demonstrates that the 
criterion remains effective for exploratory analysis and spatial data min- 
ing where other proximity graphs are used. It also establishes a hierarchy 
of the modeling power of several proximity graphs and presents how the 
argument free characteristic of the original algorithm can be traded for 
argument tuning. This enables higher than 2 dimensions by using linear 
size proximity graphs like fc-nearest neighbors. 



1 Introduction 

In Geographical data mining El, spatial clustering consists of partitioning a 
set P = {pi,P2, ■ • • ,Pn} of geo-referenced point-data in a two-dimensional study 
region R, into homogeneous sub-sets due to spatial proximity. Spatial proximity 
indicates relative closeness rather than absolute closeness. In particular, one 
peculiarity of spatial data typically manipulated in a Geographical Information 
System, is spatial heterogeneity. For example, spatial heterogeneity indicates the 
intrinsic uniqueness of each location on 2 D-space. But, statistical measurements 
are dependent upon absolute location in R (as opposed to relative location). 
Hence, the same statistical property has different interpretation over R. For 
instance, Fig. [D shows two points pi and pe equidistant from p^. Glustering 
based merely on distance would place pi and pe in the same cluster as p^. But, 
the length of edge eis = (pi,P5) is relatively long with respect to the length 
of edge 656 = (PsiPe)- Site p^ is likely to belong to the same cluster as Pq. 
So, although the lengths of the edges 615 and 655 are equal, they have different 
relative interpretations (relative to other points in R). In spatial settings (and in 
spatial clustering) relative proximity is more important than absolute proximity. 
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Spatial clusters are spatial phenomena. The mixture of global and local ef- 
fects defines clusters and contributes to their characteristics such as location, 

scatter, number of clusters, size and distribution. 

These properties are unknown prior to clustering 
and prior to Knowledge Discovery. Thus, spatial 
clustering must reveal them from the data rather 
than user pre-specified arguments or assumptions. 

This is the key difference of ESDA (Exploratory 
Spatial Data Analysis) from model-driven confirma- 
tory analysis. ESDA explores arguments to be set 
according to the spatial heterogeneity in the data. 

The problem of clustering becomes the matter of 
identifying neighbors (building a proximity graph) and quantifying their relative 
closeness and remoteness (establishing a criterion function). This is a two-step 
process: preprocessing and grouping. Preprocessing structures raw input data 
and produces a clustering schema. The schema will guide grouping in the next 
step. The structuring includes (explicitly or implicitly) an underlying proximity 
graph mm, values for arguments m and mathematical summaries 
During the grouping process, points are combined by a certain criterion function 
based on the schema generated in the previous stage. Clustering criteria reflect 
an implicit or explicit inductive bias to suggest the information that is to be 
inferred and vary from method to method. These inference criteria allow to 
estimate the cluster membership. All other inference criteria to the one used 
here seem to focus on the global peaks. Some criteria use a global 

argument. A global value ignores spatial heterogeneity. More seriously, most are 
neither data-driven nor using a mixture of both global and local effects. These 
peak criteria And the largest peaks, but have difficulty finding relative smaller 
peaks. Geographically, these are sparse clusters that are also of interest besides 
global high-density clusters. 

Recently, Estivill-Castro and Lee P proposed an effective and efficient clus- 
tering criterion, (referred here as the Short-Long criterion) overcoming the short- 
comings of peak-inference clustering and satisfying the special needs of geo- 
information. The main idea of boundary-based clustering is detection of the 
sharp changes in point density that form cluster boundaries. If such changes are 
significant in the global sense of view, then they are reported as cluster bound- 
aries. Thresholds for significance test are learned from data, but vary over R, 
thus localized spatial concentrations are correctly reported without any prior 
knowledge. The original algorithm uses the Short-Long criterion with the De- 
launay Diagram as an underlying proximity graph. Here we extend the criterion 
to work well with other proximity graphs. In particular, the Delaunay Diagram 
is linear in size for 2D but quadratic in size for 3D. However, other proximity 
graphs, (like fc-nearest neighbor and fc-cone spanner graphs) are linear in size for 
all dimensions. Thus, the main goal of this paper is to compare and contrast the 
clustering results with different proximity graphs when they are applied to the 
Short-Long criterion. Proximity graphs are ranked for their value as exploration 




Fig. 1. An example of 
spatial heterogeneity. 
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tools for very large spatial data sets, where the user may trade the argument free 
use of the Delaunay Diagram, for an argument driven exploration of the data. 

Next, we introduce the working principle of the Short-Long criterion in Sec- 
tion 0 Section 0 summarizes eight popular spatial proximity graphs. We com- 
pare and contrast their results. Section ^summarizes the results of experimental 
evaluations. Finally, the last section draws conclusions. 

2 Short-Long Criterion 

The proximity graph named Delaunay Diagram inspired the Short-Long crite- 
rion. In proximity graphs, vertices represent data points and edges connect pairs 
of points to model spatial proximity and adjacency. They encode explicitly a 
discrete relation is.NEIGHBOR C P x P. By assigning lengths to the edges of the 
relation, not only we encode proximity, but also density. 

Cluster boundaries occur where there is high discrepancy (great variability) 
among the lengths of incident edges on a vertex p in a proximity graph. This 
is because both sides of cluster boundaries have different densities (sparse and 
dense). Thus, border points in proximity graphs may have two different types 
of incident edges: short edges and long edges. The former connect points within 
a cluster (intra-cluster links) and the latter straddle between clusters (inter- 
cluster links) or between a cluster and noise/outliers. Exceptionally long inci- 
dent edges to a border point are indicative of inter-cluster links. These relatively 
close and remote neighbors to border points contribute to characterization of 
border points. Interestingly, border points share this characteristic with points 
on bridges. Bridge points also have short edges linking to other neighboring 
bridge points or border points and have long edges connecting to noise. The 




Fig. 2. An example of border and bridge points: (a) Data set (n = 26). (b) Delaunay 
Diagram, (c) Neighbourhoods of a border point and a bridge point. 



data shown in Fig. Ela) seems to have two clusters linked by two bridge points. 
Fig- He) highlights a border point (green rectangle) and a bridge point (yellow 
rhombus) with incident edges in solid lines. Both have exceptionally long edges 
in thick solid lines and exceptionally short edges in thin solid lines. In order 
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to differentiate these bridge points from border points, the Short-Long criterion 
temporarily considers exceptionally short edges as inter-cluster links. Later, it 
performs connected component analysis to recuperate all short edges that are 
intra-cluster links (typically incident to border points) and to permanently re- 
move bridges. 

More formally, for a point pi, let LocaLMean{pi) be the mean length of edges 
incident to pi and let LocaLSt-Dev{pi) be the standard deviation in the length 
of these incident edges. An edge ej incident to point pi could be classified as an 
inter-cluster edge if 

\ej\ > Local. Mean {pi) + I x Local .St .Dev (pi) . (1) 

That is, edges are considered long and inconsistent if their lengths are away from 
the local mean. However, Local.Mean{pi) and Local.St.Dev{pi) represent only lo- 
cal properties. We need to incorporate global trend to Equation ([Q to correctly 
identify globally localized spatial clusters. If the factor I in Equation du scales 
the threshold globally, it will apply absolute information to all pi but not relative 
information. This is because Local.St.Dev{pi) is expressed in the same units as 
Local.Mean{pi) . That is, we would have a uniform (absolute) acceptance inter- 
val for intra-cluster edges, for all pi. However, a smaller value of Local.Mean{pi) 
is indicative of pi as internal to a cluster, since smaller Local.Mean{pi) implies 
relative closeness around pt to its neighbors. Thus, we require the value of I to 
be greater than 1 in order to widen the acceptance interval and thus preserve a 
certain level of heterogeneity around pi for smaller Local.Mean{pi) . Conversely, 
we would require the value of I to be less than 1 to restrict the acceptance in- 
terval for larger values of Local.Mean{pi). Thus, more edges incident to border 
points or noise/outliers are removed. Let Relative.St.Dev{pi) denote the ratio of 
Local.St.Dev{pi) and Mean.St.Dev(P), where Mean.St.Dev(P) is the average of 
the Local.St.Dev{pi). Then, Relative.St.Dev{pi) provides a ratio of local devia- 
tion against global deviation. Relatwe.St.Dev(pi) is less than 1 for those pi that 
locally exhibits less spread of length of their incident edges than the generality of 
points in P. Also, Relative.St.Dev{pi) is greater than 1 for pi that locally exhibits 
more variability. The inverse of Relative.St.Dev(pi) fulfills the role required for 
the factor 1. 

I = Relative.St.Dev{pi)~^ = Mean.St.Dev(P) / Local. St. Dev{pi). (2) 

Finally, the acceptance interval AL{pi) for each pi is obtained by replacing Equa- 
tion 0 in Equation ( 0 . 

Local. Mean{pi) — Mean.St.Dev{P) < (3) 

AI{pi) < Local .Mean{pi) + Mean.St.Dev{P). 

Note that, the acceptance interval AL{pi) utilizes both local trend informa- 
tion and global trend information, where Local.Mean{pi) represents inverse local 
strength and Mean.St.Dev{P) denotes the global degree of variation. Further, 
AI{pi) is not static, but rather dynamic over R. 

The Short-Long criterion is applied to a proximity graph for the grouping 
process. Readers may refer to the original paper 0 for details of the process. 
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3 Proximity Graphs Model Spatial Proximity 

We now list 8 common families of proximity graphs. They derive from sev- 
eral modeling considerations, including modeling proximity and topology. For 
example, one possible way of capturing topological relations such as ADJA- 
CENT_TO amongst point data is to perform point-to-area transformations. A 
widely adopted point-to-area transformation is to assign every location in R to 
the nearest pi in P. This creates regions in R. The resulting tessellation is the 
well-known Voronoi Diagram denoted by VD{P). As a consequence, two points 
are neighbors if and only if their Voronoi regions share a common Voronoi bound- 
ary. The explicit representation of this is_NEIGHBOR relation is another tessella- 
tion (Delaunay Triangulation denoted by DT{P)). Many researchers PEE] use 
DT{P) as a proximity graph for clustering. Recently, Estivill-Castro and Lee |S] 
used the Delaunay Diagram, denoted by DD{P), since this removes ambiguous 
diagonals when more than three points are co-circular. 

Other proximity graphs are based on similar local closeness criteria. The 
Gabriel Graph of P, denoted by GG{P), has an edge e = (pi,Pj) if and only if 
all other points in P - {pi, pj} lie outside the circle having e as diameter. The 
Relative Neighborhood Graph of P, denoted by RNG{P), is based on the notion 
of “relative close” neighbors. Two points Pi,pj S P define an edge if they are 
relatively close enough as they are to any other point H2|. Matula and Sokal m 
believe that Gabriel Graphs provide sufficient but not excessive interconnections. 

The Minimum Spanning Tree of P, denoted by MST{P), has long been used 
for single-linkage clustering (in fact, MST{P) is not unique, there may be several 
Minimum Spanning Trees for the same data). Since MST{P) is a sub-graph of 
RNG{P) which in itself is a sub-graph of GG{P). And GG{P) is a sub-graph 
of DT{P) P3, each encodes less proximity information along this family. The 
simplicity of MST{P) makes its local statistical properties vulnerable to small 
changes. More seriously, MST{P) does not reflect local optimization information 
like DT{P), GG{P) or RNG{P), but rather global optimization. Thus, it ignores 
local variations to a larger degree. 

The Greedy Triangulation of P, denoted by GT{P), is another type of spa- 
tial tessellation implicitly encoding proximity. GT{P) is obtained by repeatedly 
inserting the shortest edge that does not intersect any previously inserted edge. 
A greedy edge e represents spatial proximity in the sense that e is the strongest 
interaction that does not interfere previously chosen stronger interactions. 

Another popular model for spatial proximity is using a metric and the cor- 
responding distance concept (this is the implicit philosophy within methods like 
DBSGAN jSj). Namely, points are considered as neighbors if and only if they lie 
within a certain distance d. The corresponding proximity graph is denoted by 
d-DG{P). However, this has a number of critical problems, the most crucial is 
that globally setting the value of the argument d ignores spatial heterogeneity. 

An alternative is to assign the same number of neighbors to each point, 
namely fc-nearest neighbors, denoted by k-NN{P). This has been used in spatial 
data mining in GHAMELEON j^j and could be seen as a variant of d-distance 
neighboring where is not globally fixed, but varies over R. All points in P are 
assigned a globally fixed number k of neighbors. 
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Table 1. Characteristics of proximity graphs. 
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One variant is the family of A:-cone spanner graphs of P, denoted by k- 
CSG{P). This captures the nearest neighbors in pre-specified k directional cones. 
Thus, A:-nearest neighbors, each the nearest in each of k directions. 

The graphs k-NN{P) and k-C SG{P) are argument-dependent (need a value 
for k) and directed (some edges are not symmetric). Table ^ summarizes the 
characteristics of proximity graphs. In two dimensions, they all have linear size, 
however, as we move up to 3D this is not the case. The graphs k-NN{P) and k- 
GSG{P) are attractive as we move to higher dimensions since they remain linear 
in size (when k is so small with respect to n that it can be regarded as a constant). 
Moreover, the field of Computational Geometry has developed O(nlogn) time 
algorithms to compute these proximity graphs. Thus, the extension of the Short- 
Long criteria proposed here is scalable to large data sets. 



4 Performance Evaluation 

When applied to Delaunay Diagrams, the Short-Long criterion robustly obtains 
clusters of arbitrary shapes, clusters of different densities, clusters of variable 
sizes, clusters in the presence of noise, sparse clusters adjacent to high-density 
clusters, closely located high-density clusters, clusters linked by multiple bridges 
and clusters in the presence of obstacles. However, it would be rather limiting if 
one is forced to use Delaunay Diagrams. We report on experiments |Z| to contrast 
the clustering results of the combination of the Short-Long criterion discussed in 
Section and the proximity graphs discussed in Section 0 The cases represent 
benchmark 2D data where mere-density clustering fails |S|. 

Close clusters of several levels of density. An initial example is sparse 
clusters surrounding closely high-density clusters. In this type of dataset, DD{P) 
produces the best result. Boundaries are well detected. Clusters surrounding 
higher density cluster are detected and kept connected, but disconnected from 
the internal high-density clusters. The results of GT{P) approximate those of 
DD{P), but points on less dense clusters may be left isolated. 

In this type of data, using GG{P) successfully identifies high-density clusters, 
but fails to reliably detect the sparser cluster in that recuperation is incomplete 
(more points left isolated). Using RNG{P) deteriorates the results more. Now, 
not only are sparser cluster undetected, but the high-density clusters are divided 
into meaningless sub-groups. 
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Not surprisingly, MST{P) reports poor results. The main reason for this 
is that MST{P) has lost proximity information in its n — 1 interconnections. 
Note that DD{P) has a small constant factor more edges (3n — 6 at most) and 
encodes far much more complete spatial proximity. The simple proximity infor- 
mation makes MST{P) extremely vulnerable to small variations, and it is not 
surprising that MST{P) is known to be fast but also known to be a fragile 
clustering method. Intermediate quality is obtained with GG{P) and RNG{P). 
These proximity graphs tend to exclude relatively heterogeneous interconnec- 
tions from DT{P) such as edges between the sparse cluster and the high-density 
clusters. This reduction eventually decreases Mean.St.Dev{P) and thus narrows 
AI{pi) for each pi (refer to Equation (3)). As a consequence, relative hetero- 
geneity around large LocaLMean{pi) (within the sparse cluster) is not preserved, 
which causes the sparse cluster to be fragmented. In general, the sub-graphs of 
DT{P) miss some proximity information, and complicate detection of sparser 
clusters. 

The family of graphs k-NN{P) provides good results when the best value for 
k is used. It successfully detects the global hot spots, but still has difficulty to 
identify the surrounding sparser concentration. This is to be expected, but still 
a valuable result in higher dimensions when Delaunay Diagrams are quadratic. 
This is because the proximity information in k-NN (P) is biased to one direction 
for points in the border of a cluster. That is, edges connect to the dense side 
of the border where the cluster is. Non-connectivity between the sparse cluster 
and the dense clusters lowers MeanSLDev{P) , which eventually narrows AI{pi) 
for each pi. Thus, the Short-Long criterion considers local variations around the 
sparse cluster as too heterogeneous. 

After several argument tuning steps (locating the value fc), the best result 
of k-CSG{P) are good (as k-NN{P)). Interestingly, some points belonging to 
sparser cluster are connected to high-density clusters. 

Argument tuning (locating a value for d) is necessary for the best result of 
d-DG{P). This extracts dense clusters but sparser clusters are not identified, 
when intra-cluster distance within sparser clusters is similar to inter-cluster dis- 
tance between high-density clusters and sparser clusters. Thus, if the value for 
the argument d is large enough, the sparse cluster and the high-density clusters 
have many intra-connections. This lowers MeanSt-Dev{P) significantly. Conse- 
quently, d-DG{P) reports all the points as a single cluster. Inversely, if the value 
of d is small enough just to avoid the total merge, less dense surrounding points 
are not identified as related into a cluster. 

Narrow links of high-density clusters. This situation may be seen as 
clusters linked by narrow multiple bridges. Unlikely to the first example, prox- 
imity graphs are robust to multiple bridges except for RNG{P) and MST{P). 
This is a great improvement over typical mere-density clustering. The graph 
k-GSG{P) does not work as well for this case, but still separates some bridges. 

Narrow gap of high-density clusters. In many real world settings, clus- 
ters of different densities are closely located. For example, two highly populated 
cities lying on opposite side of a river, or densely located troops around national 
boundaries. Datasets emulating these scenarios were tested. The two spatial tes- 
sellations, DD{P) and GT{P) hold enough interconnections and the clustering 
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results pinpoint closely located dense clusters while still identifying other rela- 
tively sparse clusters. 

Again, the argument k has to be tuned by the user for k-NN{P) and k- 
CSG{P). But, the results are very satisfactory and parameter-specific neigh- 
boring k-NN{P) and k-CSG{P) find the clusters. These graphs have produced 
good results, but at the cost of user direction and exploration of the argument 
values. 

For the same reason explained in the first example before, large values of d 
detect the sparser clusters but merge closely located dense clusters into one. Also, 
a smaller value of d is ineffective because the result leaves points within sparser 
cluster isolated. The proper sub-graphs of DT{P) offer less quality results as 
they are more distant from DT{P). We can see that the proper sub-graphs of 
DT{P) may work for the resolution of multiple bridges, but fail to detect clusters 
with different densities due to loss of proximity and density information. 

Up to this point it seems that GT{P) is as good as DD{P) for the Short- 
Long criterion. These two graphs are approximations to the Minimum Weight 
Triangulation. Thus, they may be similar in some cases, and when that happens, 
resulting clustering is naturally very similar. However, our experiments showed 
cases where GT{P) is radically different to DT{P). In such cases, the same prob- 
lems of single-linkage clustering appear, creating greedily long paths of artificial 
bridges using short edges. Thus, GT{P) is not better than DT{P). 

Experiments conducted in this section indicate that the Short-Long criterion 
works well with properly modeled proximity graphs holding enough information 
but not excessive. Our experiments also show that the criterion may work well 
with the argument-specific graphs k-NN and k-GSG with careful tuning of 
the argument value. Since typical mere-density clustering fails to detect these 
examples |Sj, experiments demonstrate the robustness of the Short-Long criterion 
and its applicability to other proximity graphs. 

Sensitivity to Arguments k and d. We proceeded to analyze sensitivity 
with respect to the values of argument k and d. A change of these values results 
in rather different proximity graphs and thus clustering results. In the graph 
k-NN, for the smaller values of k, high-density clusters are broken up into less 
meaningful sub-groups. This is due to the fact that smaller values of k only 
capture a few very close neighbors for each point, thus neighbors for each point 
are relatively homogeneous in terms of their lengths of edges. This homogeneity 
lowers the global indicator of heterogeneity and narrows the acceptance interval 
for each point, which eventually causes homogeneous local variations to be frag- 
mented. In order to avoid this fragmentation, we may tune the argument k to 
higher values. In this case, too many interconnections merge the dense clusters 
and the sparse cluster into one large cluster. In addition, the higher values of 
k connect bridge points together. Exploring different values of k allows users to 
investigate the cohesion of clusters relative to others as well as the emergence of 
bridges, geographical features. Bridges are to lines what clusters are to areas. 

Similar merge and split happen to k-GSG(P) when we tune the value of 
k. As we increase the value of k, more points merge. Large values of k detect 
sparse clusters but merge the closely located high-density clusters. Again, this 
type of analysis can assess the relative cohesion between clusters relative to other 
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clusters. It can identify where a cluster is weaker, about to split. The analysis can 
indicate points that are just hanging to a cluster and are likely to have/receive 
less influence on fellow cluster members. 

However, the results with exploring along d with d-DG{P) are not as en- 
couraging. When a value d allows to detect densely populated clusters, it then 
fails to detect the sparser clusters. If we increase the value of d until it is long 
enough for points within sparser cluster to reach their neighbors, then all points 
belong to the same cluster. The increase of d merges not only bridges, but closely 
located high-density clusters. 

The sensitivity analysis illustrates the trade-offs of tuning arguments. Several 
trial and error steps to And best-fit arguments require exploration time, and 
user bias may be introduced. Argument-specific proximity graphs demand user 
exploration of argument values for clustering massive data sets. They offer in 
exchange information on where the clusters are about to break, information on 
bridges and other elements connecting or about to merge clusters. 

In fact, we have seen that d-DG{P) is one of the poorest performers (also 
poor are MST{P) and RNG{P)). In some cases, not any value of d results to be 
appropriate nor satisfactory. This is disappointing for algorithms like DBSCAN 
based on this type of proximity representation. They not only require the use 
of queries into data structures like R- Trees to obtain the proximity information, 
but also the use of visualization tools like OPTICS P to aid the user in tuning 
the arguments for d. They may in fact miss many clusters. 



5 Final Remarks 

Edges in proximity graphs represent interactions between neighboring points, 
thus they encode spatial proximity. Assigning a weight proportional to the length 
of the edges provides spatial density. In practice, the more edges a proximity 
graph has, the more proximity information the graph possesses. However, some 
of these edges may become irrelevant. Theoretically, the set P (of size n) encodes 
all the proximity information, while the explicit complete graph, although a 
proximity graph, it is certainly redundant since it has quadratic size. We have 
shown that the Short-Long criterion is applicable to a large family of proximity 
graphs to obtain satisfactory clustering. That is, the Short-long criterion does not 
require maximal information. It just requires enough interconnections to detect 
spatially aggregated concentrations. This is very important for Data Mining 
applications, since the intent is to apply the Short-Long criteria to proximity 
graphs that take sub-quadratic time to compute and thus use sub-quadratic 
space. Such is the case of Delaunay Diagrams and its sub-graphs presented here 
which require 0{nlogn) time and space in two dimensions. But, as we progress 
to three dimensions this is no longer the case, and graphs like k-NN and k-CSG 
are sub-quadratic in time and space for any dimension. We have shown that the 
Short-Long criteria will be effective for these proximity graphs, at the expense of 
more user participation in setting argument values. This exploration can detect 
cohesion of clusters relatively to other clusters, potential weakness on a cluster 
or its likely split as well as items about to depart from the cluster. 
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Abstract. Currently there is no model available that would facilitate the task of 
finding similar time series based on partial information that interest users. We 
studied a novel query problem class that we termed micro similarity queries (MSQ) 
in this paper. We present the formal definition of MSQ. A method is investigated 
for the purpose of efficient processing of MSQ. We evaluated the behavior of MSQ 
problem and our query algorithm with both synthetic data and real data. The results 
show that the knowledge revealed by MSQ corresponds with the subjective feeling 
of similarity based on singular interest. 

Keyword: Micro similarity query. Micro nearest Jieighbor queries, Micro range 
query. Time series analysis, Data mining algorithm." 



1. Introduction 

Time series constitute a large part of data stored in many information systems. 
Algorithms that solve the problem of finding similar series only based on the partial 
interesting information are crucial to many data mining applications. 

There have been several efforts to develop similarity query model for time series 
data. [4] uses real- valued functions for representing and querying. In [5], non- 
overlapping ordered similar subsequence is used. Fourier transform [1] and singular 
value decomposition [6] are used to provide better performance. 

These similarity query models mainly focus on the overall and the most 
remarkable series behaviors. All information, or partial information that is chosen just 
by ignoring parts of minor energy, is used during the query process. An example is 
“find the stocks with similar price movement”. Using this kind of query models, the 
local or minor series behaviors are seen as useless information and sometimes are 
ignored for the sake of efficiency. However, in many applications, this kind of 
unconsidered behaviors is not what we can ignore but what interested us. For 
example, in data mining problems in stock market, we are not only care about the 
overall price movements, but also local or minor ones or the combination of them. A 
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possible problem is “identify stocks whose price usually fluctuate at same time and 
with same amplitude”. 

We believe that the queries based on these series behavior can be widely used in 
many data mining applications. To the best of our knowledge, this has not been well 
considered. And the problem has not been formally defined. 

Cluster identifiers of sub-series [3] and subsequence matching [2] can also be used 
in similar queries by local features. However, the interesting behaviors are always 
influenced, or even flooded by the overall and most remarkable series movement. The 
patterns these methods reveal may contain useless information. This makes it difficult 
or even impossible to discover really useful knowledge. 

For this reason, we propose a novel query problem class that we termed micro 
similarity queries (MSQ) and a method for it. We evaluated the behavior of MSQ 
problem and our query algorithm with both synthetic data and real data, which 
showed the query results closely correspond with the subjective feeling of similarity 
between the time series based on singular interest. 

This paper is organized as follows: Section 2 formally defines the MSQ problem. 
Section 3 describes our solution to this problem. Section 4 presents our experimental 
results. Finally, Section 5 offers some concluding remarks. 



2. Formal Definition of the Problem 

A time series 0(n) is a sequence of real numbers, each number representing a value at 
a time point. In this paper, the time series to be compared are with the same length. 
The formal definition is as follow: 

Definition 2.1: Choose a rule F to decompose 0(n), (C^*(w))and 

0{n) = ^ 0^{n) 

k ,(0^(n)) is said to be the k-th F-based decomposition of 0(n). 

Definition 2.2: Let 0„,(n) be the m-th sub series of 0(n) and (0„j,(n)) be the F 
based decomposition of 0,„(n). Choose a one-to-one series-value mapping rules T, 

map each 0„j,(n) to a value O mk ^ ^ k (m) Then all S^(m) combine a sequence 
with the original order. The sequence is said to be the k-th micro representing 
sequence. 

Definition 2.3: Let sequence be the k-th micro representing sequence. 
K=(K^,K 2 ,...,K^^,), where K„ is the order of n-th interesting micro representing 
sequence. The degree of user’s interest to the k-th micro representing sequence is A^. 
The sequence 

S(m) = 

is said to be the full micro representing sequence of time series 0(n) (FMR(O)). 
Here MOD(m,K) represents the remainder of m/K. K=(Kj,...,K^) and A=(Aj,...,A^) 
represent user’s interest. 

Definition 2.4: Let X, Y be two time series. The micro distance MD (X,Y) is 
MD(X,Y)=D (FMR(X),FMR(Y)). 

Here D(X,Y) is a distance measurement model currently in use. For example, we 
can use Euclidean distance: D(X,Y)=Lj(X,Y). 




360 X.-m. Jin, Y. Lu, and C. Shi 



Definition 2.5: Given time series set SS and query time series Q, the micro nearest 
neighbor queries (MNNQ) problem is 

MNN(Q)={ X*SSIVY*SS: MD(X,Q)< MD (Y, Q)} 

Definition 2.6: Given a threshold s, time series set SS and query time series Q, the 
micro range queries (MRQ) problem is 

MR(Q)={X*SSIMD(X,Q)*s } 



3. Method for MSQ Problem 



3.1 Decomposition and Representation 



We use a discrete transform for time series decomposing and representing and use 
distance to measure the distance of FMR sequence. Here we would like a transform 
(1) that is easy to compute and (2) that can represent interesting pattern easily and 
efficiently. We have chosen the Discrete Cosine Transform (DCT) in our experiments 
because it is wildly used in many areas and it does a good job of meet our 
requirements. 

Let 0(n) be a time series with length N. We first divide input time series by sliding 
a window of width r. Then the j-th window of O is a contiguous sub series Oj=(0(Jr- 
r+1 ),..., 0(jr)), q>j>l where q=N/r. 

Let H be the transform matrix. Then the r point discrete transform of sub series O. 
is defined to be series T (A:), given by T.= H O^ 

Based on the results of discrete transform, the decomposing version of 0(n) in our 

method is: ~ L” ^ ^-1' (K>k>0)where is 

[^o,m ^i.m . The k-th micro representing sequence 

is: 'S't (wj) = T„(A:) .FMR(O) can be calculated as:'^^”’^ “ 

We can calculate the Euclidean distance between FMR (X)=XS and FMR (Y)=YS 
to determine MD(X,Y). This makes it easy to apply the current index structures and 
query methods to MSQ problem. Let DT(Xj)=XTJ*Dr(Yj)=YTj. It’s easy to proof: 



MD{X,Y) = L,{XS,YS) = [^^{XTj{KJ-YTj(KjfA,^^ 



j=i 11=1 



Our method can be irrelevant with baseline if XT^ (0) and YT^ (0) are not used in 
the process of calculating MD. The following MD is irrelevant with scaling factors. 



MD(X,Y) = JXZ( 

7=1 n=l 



q W 



XT,(KJ 2 



xr^(O) 



yr.(O) 



3.2 Indexing and Query Processing 

Our overall querying strategy is as follows. 

Step 1 - According to their applications, users need express their interest by K and 
A. 
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Step 2 - feature retrieving: We obtain the transform coefficients needed by 
applying transform equation on each sub series. Then, we use the selected transform 
coefficients to construct FMR of each time series. 

Step 3 - Index construction: Insert all FMR into a multidimensional index for 
supporting efficient querying. The MNNQ and MRQ problems can be simplified to 
nearest neighbor query and range query. Therefore, our approach can be applied with 
any index structure that was designed for nearest neighbor query or range query such 
as R*-tree [7] and SS-tree [8]. 

Step 4 - After the index has been built, we can carry out MRQ or MNNQ. Any 
algorithms that fit for range query and nearest neighbor query can be used here. 



4. Experimental Results 

In our experiments, DCT was used, K=(2), A=(I). This means we want to find time 
series that always show similar up/down behaviors to the query time series in each 
window. 

For experiments with synthetic data, we use a generated random walk time series 
Q(n) with 180 data points as the query time series. Then we query it within data set 
consisting 5 time series that were generated using the following function: 

(«) = Q(n) + A„ sin(2;OT !TJ + e 

Here e is Gaussian noise. An example data set is depicted in fig. 1, the coefficients 
used are showed in table 1 together with the distance D and micro distance MD. In 
this experiment, the length of sub series was set to 8. 

From the visual analysis, as the frequency 1/T_^ increases, the micro distance 
increases while the global distance decrease, i.e. for the time series in which change 
of lower frequency sinusoid was applied, MD become smaller while the distance D 
become larger. Our algorithm presents the same results. 




Table 1. Experimental Results for Syn- 
thetic Data 
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Fig. 1. Synthetic Date Set 
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Table 2. Experimental Results 
for Real Data 
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Fig. 2. Query Results for Real Data 



Real data we used in our experiments was extracted from different equities of 
Shenzhen stock market from 10/1/1997 to 8/1/2000. They have been collected daily 
over the time period. Totally 100 time series were used. All FMRs were extracted by 
a sliding window of size = 5 which accorded to the number of exchanging days in one 
week and were inserted into an R*-Tree. 

One of our results is shown in fig. 2. SI is the query time series, others is the most 
nearest 9 series sorted by the distance from SI. The micro distance is listed in table 2. 
The results reveal a new kind of associations between the companies, which can’t be 
discovered by other query model focus on overall time series behavior. 



5. Conclusion 

In this paper, we studied a novel query problem to find the association in time series 
database, which can’t be discovered by other query model that focus on overall and 
the most remarkable time series behavior. We believe this kind of queries can be 
widely used in many data mining application, and will discover novel and interesting 
knowledge. 

Application knowledge should be used during the procedure of definition of 
interesting pattern, coefficients choosing, and analysis of the results. Whether the 
algorithm will find the time series we want is mainly depend on the well involvement 
of application knowledge. 
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Abstract. We define an optimal class association rule set to be the 
minimum rule set with the same prediction power of the complete class 
association rule set. Using this rule set instead of the complete class 
association rule set we can avoid redundant computation that would 
otherwise be required for mining predictive association rules and hence 
improve the efficiency of the mining process significantly. We present an 
efficient algorithm for mining the optimal class association rule set using 
an upward closure property of pruning weak rules before they are actually 
generated. We have implemented the algorithm and our experimental 
results show that our algorithm generates the optimal class association 
rule set, whose size is smaller than ^ of the complete class association 
rule set on average, in significantly less time than generating the complete 
class association rule set. Our proposed criterion has been shown very 
effective for pruning weak rules in dense databases. 



1 Introduction 

1.1 Mining Predictive Association Rules 

The goal of association rule mining is to find all rules satisfying some basic 
requirement, such as the minimum support and the minimum confidence. It 
was initially proposed to solve market basket problem in transaction databases, 
and has then been extended to solve many other problems such as classification 
problem. A set of association rules for the purpose of classification is called 
predictive association rule set. Usually, predictive association rules are based on 
attribute value (relational) databases, where the consequences of rules are pre- 
specified categories. Clearly, an attribute value database can be mapped to a 
transaction database when an attribute and attribute value pair is considered 
as an item. After having mapped an attribute value database into a transaction 
database, a class association rule set is a subset of association rules with the 
specified targets (classes) as their consequences. Generally, mining predictive 
association rules undergoes the following two steps. 

1. Find all class association rules from a database, and then 

2. Prune and organize the found class association rules and return a sequence 
of predictive association rules. 
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In this paper, we focus on the first step. There are two problems in finding 
all class association rules. 

— It may be hard to find the all class association rule set in dense databases due 
to the huge number of class association rules. For example, many databases 
support more than 80, 000 class association rules as in m- 

— Too many class association rules will reduce the overall efficiency of mining 
predictive association rule set. This is because the set of found class asso- 
ciation rules is the input of the second step processing whose efficiency is 
mainly determined by the number of input rules. 

To avoid the above problems, it is therefore necessary to find a small subset 
but with the same prediction accuracy of the complete class association rule 
set, so that this subset can replace the complete class association rule set. Our 
proposed optimal class association rule set is the smallest subset with the same 
prediction power, which will be formally defined in Section 2, of the complete 
class association rule set. We present an efficient algorithm to generate the op- 
timal class association rule set that takes the advantage of an upward closure 
property to prune those complex rules that have lower accuracy than their sim- 
ple form rules have before they are actually generated in dense databases. Our 
algorithm avoids redundant computation of mining the complete class associa- 
tion rule set from dense databases and improves efficiency of the mining process 
significantly. 

1.2 Related Work 

Mining association rules is a central task of data mining and has shown 
applications in various areas nsini - Currently most algorithms for mining as- 
sociation rules are based on Apriori j^, and used the so-calle “downward closure” 
property which states that all subsets of a frequent itemset must be frequent. 
Example of these algorithms can be found in mm- A symmetric expression 
of downward closure property is upward closure property — all supersets of an 
infrequent itemset must be infrequent. We will use this property throughout the 
paper. 

Finding classification rules has been an important research focus in the ma- 
chine learning community [1 iSpiSj . Mining classification rules can be viewed as a 
special form of mining association rules, since a set of association rules with pre- 
specified targets can be used for classification. Techniques for mining association 
rules have already been applied to mining classification rules I3I21. Particularly, 
results in |E| are very encouraging, since it can build more accurate classifiers 
than those from C4.5 [E|. However, the algorithm in H2] is not very efficient 
since it uses Apriori-like algorithm to generate the class association rules, which 
may be very large when the minimum support is small. In this paper will show 
that we can use a much smaller class association rule set to replace this set while 
not losing accuracy (prediction power). 

Generally speaking, class association rule set is a type of target-constraint 
association rules. Constraint rule sets 0 and optimal rule sets |5 belong to this 
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type. Problems with these rule sets are that they either exclude some useful 
predictive association rules, or contain many redundant rules that are of no use 
for prediction. Moreover, algorithms for mining these rule sets handle only one 
target at one time (building one enumeration tree), so they cannot be efficiently 
used for mining class association rules that are on multiple targets, especially 
when the number of targets is large. Our optimal class association rule set differs 
from these rule sets at that it is minimal in size and keeps high prediction 
accuracy. We propose an algorithm that finds this rule set with respect to all 
targets at once. 

In this paper we only address the first step of mining predictive association 
rules. Related work on pruning and organizing the found class association rules 
can be referred to mm- 



1.3 Contributions 

Contributions in this paper are the following. 

1. We propose the concept of optimal class association rule set for predictive 
association rule mining. It is the minimum subset of complete class associa- 
tion rule set with the same prediction power as the complete class rule set, 
and can be used as a substitute of the complete class association rule set. 

2. We present an efficient algorithm for mining the optimal class association 
rule set. This algorithm is different from Apriori at that 1) it uses an ad- 
ditional upward closure property for forward pruning weak rules (pruning 
before they are generated), and 2) it integrates frequent sets mining and rule 
finding together. 

Unlike the existing constraint and optimal rule mining algorithms, our algo- 
rithm finds strong (optimal) rules with all possible targets at one time. 

2 Optimal Class Association Rule Set 

Given attribute-value database D with n attribute domains. A record of D is 
a n-tuple. For the convenience of description, we consider a record as a set of 
attribute and value pairs, denoted by T. A pattern is a subset of a record. We say 
a pattern is a k-pattern if it contains k attribute and value pairs. An implication 
in database D \s A ^ c, where A is a pattern, called antecedent, and c is an 
attribute value, named consequence. Exactly, the consequence is an attribute and 
value pair, but in class association rule mining, the target attribute is usually 
specified, so we can use its value directly without confusing. The support of 
pattern A is defined to be the ratio of the number of records containing A to 
the number of all records in D, denoted by sup{A). The support of implication 
A => c is defined to be the ratio of the number of records containing both A 
and c to the number of all records in D, denoted by sup{A => c). The confidence 
of the implication A => c is defined to be the ratio of sup{A => c) to sup{A), 
represented by conf{A => c). 
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A class association rule is defined to be an implication with a pre-specified 
target (a value of target attribute) as its consequence and its support and con- 
fidence are above given thresholds from a database respectively. Given a target 
attribute, minimum support a and minimum confidence '0, a complete class as- 
sociation rule set is a set of all class association rules, denoted by 

Our goal in this section is to find the minimum subset of the complete class 
association rule set that has the same prediction power as the complete class 
association rule set. 

To begin with, let us have a look at how a rule makes prediction. Given a rule 
r, we use cond{r) to represent its antecedent (conditions), and cons{r) to denote 
its consequence. Given a record T in a database D, we say rule r can make 
prediction on T if cond{r) C t, denoted by r{T) — >■ cons{r). If cons{r) is the 
category (target attribute value) of record T, then this is a correct prediction. 
Otherwise, a wrong prediction. 

Then we consider the accuracy of a prediction. We begin by defining the ac- 
curacy of a rule. Gonfidence is not the accuracy of a rule, or more precisely, not 
the prediction accuracy of a rule, but the sample accuracy, since it is obtained 
from the sampling (training) data. Suppose that all instances in a database are 
independent of one another. Statistical theory supports the following assertion 

I jThj : acct{r) = accg ± Zn'\J ^ -^here acct is the true (prediction) accu- 
racy, acCs is the accuracy over sampling data, n is the number of sample data 
(n > 30), and zjv is a constant relating to confidence interval. For example, 
Z]s[ = 1.96 if confidence interval is 95%. We use pessimistic estimation as the 

prediction accuracy of a rule. That is acc(r) = conf{r) — zn\J ^ ^ cLt( r ’ 
where cov(r) is the covered set of rule r that is defined in the next section. If 
n < 30, then we use Laplace accuracy instead |E|, that is acc(r) = ^"eo«(*)|+p ^ ’ 
where p is the number of target attribute values (classes) . 

After we have obtained the prediction accuracy of a rule, we can estimate 
the accuracy of a prediction as follows: the accuracy of a prediction equals to the 
prediction accuracy of the rule making such prediction, denoted by acc{r(T) — >■ 

c)- 

In the following part, we will discuss a prediction made by a rule set, and 
how to compare the prediction power of two rule sets. 

Given a rule set R and an input T, there may be more than one rule in R 

that can make prediction, such as, ri{T) — >■ ci,r 2 (T) — >■ C 2 , We say that the 

prediction made by R is the same as the prediction made by r if r is the rule 
with the highest prediction accuracy of all where cond(ri) C t. The accuracy 
of such prediction equals to the accuracy of rule r. In case if there are more than 
one rule with the same highest prediction accuracy, we choose the one with the 
highest support among them. When the predicting rules have the same accuracy 
and support, then we choose the one with the shortest antecedent. If there is no 
prediction made by R, then we say the rule set gives arbitrary prediction with 
accuracy zero. 
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To compare prediction power of two rule sets, we define 
Definition 1. Prediction power 

Given rule sets R\ and i?2 from database D, we say that R2 has at least the same 
power as R\ iff, for all possible input, both R\ and R2 give the same prediction 
and prediction accuracy of R2 is at least the same as that of R\ . 

It is clear that not all rule sets are comparable in their prediction power. 
Suppose that rule set R2 has more power than rule set R\. Then for all input 
T, if there is rule ri G Ri giving prediction c with accuracy k\, then there must 
be another rule r2 € R2 so that r2{T) — >■ c with accuracy H2> ki- 

We represent that rule set R2 has at least the same power as rule set R\ by 
R2 > Ri- It is clear that R2 has the same power as i?i iff R2 > R\ and i?i > i?2- 
Now, we can define our optimal class association rule set. 

Given two rules r\ and r2, we say that V2 is stronger than ri iff r2 C ri A 
acc{r2) > accfri), denoted by T2 > r\. Specifically, we mean cond{r2) C cond(ri) 
and cons(r2) = cons(ri) when we say r2 C ri. Given a rule set R, we say a rule 
in R is strong if there is no other rule in R that is stronger than it. Otherwise, 
the rule is weak. Thus, we have the definition for optimal class association rule 
set. 

Definition 2. Optimal class association rule set 

Rule set Ro is optimal for class association over database D iff ( 1 ) \/r G Ro, $r' G 
i?o such that r < r' and ( 2 ) Vr' G Rc — Ro, 3 r G Ro such that r > r' . 

It is not hard to prove that the optimal class association rule set is unique at 
given minimum support and minimum confidence from a database. Let Ro{(J, f) 
stand for the optimal class association rule set on database D at given minimum 
support a and minimum confidence ip. Then Ro{a,ip) contains all strong rules 
from the complete class association rule set Rc{cr,ip). 

Finally, we consider the prediction power of the optimal class association rule 
set we are concerned with. 

Theorem 1. The optimal class association rule set is the minimum subset of 
rules with the same prediction power as the complete class association rule set. 

Proof. For simplicity, let Rc stand for Rc{a,ip) and Ro for Ro{a,ip). 

First, from the previous definitions we have that Rc > Ro and Ro > Rc, 
so the optimal class association rule set has the same prediction power as the 
complete class association rule set has. 

Secondly, we prove the minimum property of optimal class association rule 
set. Suppose that we leave out rule r from the optimal class association rule set 
Ro, R'c = Ro — r, and R'^ has the same prediction power as Rc has. From the 
definition, we know that there is no rule being stronger than rule r, so Ro > R'^, 
but R'c ^ Ro. As a result, R'c cannot be the same prediction power as Rc is, 
leading to contradiction. Hence, Ro is the minimum rule set with the property 
of same prediction power as the complete class association rule set has. 

The fact that the optimal class association rule set has the same prediction 
power as the complete class association rule set is because it contains all strong 
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rules. Even though the class association rule set is usually much larger than 
the optimal class association rule set, it contains many weak rules that cannot 
provide more prediction power than their strong rules do. In other words, the 
optimal class association rule set is totally equivalent to the complete class as- 
sociation rule set in terms of prediction power. Thus, it is not necessary to keep 
a rule set that is larger than the optimal class association rule set, and we can 
find all predictive association rules from the optimal class association rule set. 

In the next section, we will present an efficient algorithm to mine the optimal 
class association rule set. 

3 Mining Algorithm 

A straightforward method to obtain the optimal class association rule set Ro 
is to first generate the complete class association rule set Rc and then prune 
all weak rules from it. Clearly mining complete class association rule set Rc 
is very expensive and almost impossible when the minimum support is low. In 
this section, we present an efficient algorithm that can find the optimal class 
association rule set directly without generating Rc first. 

Most efficient association rule mining algorithms use the upward closure prop- 
erty of infrequency of pattern: if a pattern is infrequent, so are all its super 
patterns. If we can find a similar property for weak rules, then we can avoid 
generating many weak rules, hence making the algorithm more efficient. In the 
following we will discuss an upward closure property for pruning weak rules 

Let us begin with some definitions. We say that ri is a general rule of T2 or 
r2 is a specific rule of ri if cond{ri) C cond(r2) A cons(ri) = cons(r2). We define 
the covered set of rule r to be the set of records containing antecedent of the 
rule, denoted by cov(r). Similarly, covered set of a pattern A is defined to be 
the set of records containing the pattern, denoted by cov{A). It is clear that the 
covered set of a specific rule is a subset of the covered set of its general rule. 

Suppose that X and Y are two patterns in database D, and XY is the 
abbreviation of X UY. We have the following two properties of covered set. 

Property 1 . cov{X) C cov{Y) iff sup{X) = sup(XY). 

Property 2. cov(X) C cov(Y) iff E C X. 

Now we discuss an upward closure property for pruning weak rules. Given 
database D and a target value c in target attribute G, we have 

Lemma 1 . If cov{X~>c) C cov{Y->c), then XY c and all its specific rules 
must be weak. 

Proof We rewrite the confidence of rule A c as gup(Acf+tul{A-^c) ■ know 
that function f{u) = is monotonically increasing with u when u is a con- 
stant. Noticing sup{Xc) > sup{XY c) and sup{X~>c) = sup(XY-ic), we have 
conf{X c) > conf{XY c). Using relation \cov{X => c)| > \cov{XY c)|, 
we have acc{X c) > acc{XY c). As a result, A => c > XY c 
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Since cov(XZ-ic) C cov(VZ-ic) for all Z, we have XZ c > XYZ => c for 
all Z. 

Consequently, XY => c and all its specific rules are weak. 

We can perceive the lemma as follows: adding a pattern to the conditions of 
a rule is to make the rule more precise (with less negative examples), and we 
shall omit the pattern that fails to do so. 

Corollary 1. If cov(X) C cov(Y), then XY => c and all its specific rule must 
be weak for all c € C. 

We can understand the corollary in the following way: we cannot combine a 
super concept with a sub concept as the antecedent of a rule to make the rule 
more precise. 

Lemma 1 and Corollary 1 are very helpful for searching strong rules, since 
we can remove a set of weak rules as soon as we find that one satisfies the above 
Lemma and Corollary. Hence, the searching space for strong rules is reduced. 

To find those patterns satisfying Lemma 1 and Corollary 1 efficiently, we 
need to use properties 1 and 2. Property 1 enables us to find subset relation 
by comparing supports of two patterns. This is very convenient and easy to 
implement since we always have support information. By Property 2, we can 
always find that the covered set of a pattern (e.g. X) is a subset covered set of 
its |X| — 1 cardinality subpattern. So, we only need to compare the support of 
a fc-pattern with that of its (fc — l)-subpatterns in order to decide whether the 
/c-pattern should be removed. 

Since both Lemma and Corollary state upward closure property of weak 
rules, we can have an efficient algorithm to prune them. 

Basic Idea of the Proposed Algorithm 

We use a level-wise algorithm to mine the optimal class association rule set. 
We search strong rules from antecedent of 1-pattern to antecedent of fc-pattern 
level by level. In each level, we select strong rules and prune weak rules. The 
efficiency of the proposed algorithm is based on fact that a number of weak 
rules are removed once satisfaction of the Lemma or the Corollary is found. 
Hence, searching space is reduced after each level’s pruning. The number of 
phases of reading a database is bounded by the length of the longest rule in the 
optimal class association rule set. 

Storage Structure 

A prefix tree, or enumerate tree 0 is used as the storage structure. A prefix tree 
is an ordered and unbalanced tree, where each node is labeled by an element in 
a sorted base set, B, representing a set S C B containing all labels from the root 
to the node. Since set S is unique in a prefix tree, we can use it as the identity 
of a node. 

We use an extended prefix tree, named candidate tree in our algorithm. The 
base set here contains all attribute and value pairs and they are sorted in the 
order of their first references. A node in a candidate tree store a pattern A that 
is the identity of the node, a potential target set Z, and a supset of possible 
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attribute and value pair sets Q. Pattern A is the antecedents of a possible rule. 
The potential target set Z is a set of values of target attribute that may be 
consequences of A. For each target (e.g. zj) in Z, there is a set of possible 
attribute and value pairs which may be conjunct with A to form more accurate 
rules, Qj G Q. 

Our algorithm is given as follows. One distinction between this algorithm 
and other prefix tree based algorithms is that our algorithm finds all class 
association rules with respect to all consequences from one candidate tree rather 
than many candidate trees. 

Algorithm: Optimal Class Association Rule Set Miner 

Input: Database D with specified target attribute C, minimum support cr 
and minimum confidence ip. 

Output: Optimal class association rule set R. 

Set optimal class association rule set i? = 0 
Count support of 1-patterns 
Initiate candidate tree T 

Select strong rules from T and include them in R 
Generate new candidates as leaves of T 
While (new candidate set is non-empty) 

Count support of the new candidates 
Prune the new candidate set 
Select strong rules from T and include them in R 
Generate new candidates as leaves of T 
Return rule set R 

In the following, we present and explain two unique functions in the proposed 
algorithm. 

Function: Candidate Generating 

This function generates candidates for strong rules. Let Ui denote a node 
of the candidate tree, Ai be the pattern of node n^, Z{Ai) be the potential 
target set of Aj, and Qq{Ai) be a set of potential attribute value pairs of Ai 
with respect to target Zq. We use V^{Ak) to denote the set of all p-subsets of Ak- 

for each node at the p-th layer 

for each sibling node rii and rij {rij is after rii) 

generate a new candidate Uk as a son of nt such that / / combining 
Ak = Ai U Aj 

Z{Ak) = Z(A,) n Z(A,) 

Qq(Ak) = Qg(A^) n Qq(Aj) for all Zq G Z{Ak) 
for each z G Z{Ak) / / testing 

if BA G pP{Ak) such that sup{A U z) < a 
then Z(Afc) = Z{Ak) — z 
if Zfc = 0 then remove node Uk 
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We generate the (p + l)-layer candidates from the p layer in the candidate 
tree. First, we combine a pair of sibling nodes and insert their combination as a 
new node in the next layer. We initiate the new node with the union of the two 
nodes. Next, if any of its p-subpatterns cannot get enough support with any of 
the possible targets (consequences), then we remove the target from the target 
set. When there is no possible target left, remove the new candidate. 

Function: Pruning 

This function prunes weak rules and infrequent candidates in the (p + l)-th 
layer of candidate tree. Let Tp+i be the (p+ l)-layer of the candidate tree. 

for each nt € Tp+i 

for each A G V^{Ai) / /A is a p-subpattern of Ai 

if sup{A) = sup{Ai) then remove node //Corollary 1 

else for each Zj G Z{Ai) 

if sup{Ai U Zj) < a then Z(Ai) = Z(Ai) — Zj 
II minimum support requirement 

else if sup(A U ~^Zj) = sup{Ai U ~^Zj) then Z{Ai) = Z(Ai) — Zj 
II Lemma 1 

\i Z{ A) = % then remove node rii 

This is the most important part of the algorithm, as it dominates the ef- 
ficiency of the algorithm. We prune a leaf from two aspects, frequent rule re- 
quirement and strong rule requirement. Let us consider a candidate rii in the 
{p+ l)-th layer of tree. To examine satisfaction of Corollary 1, we test support 
of pattern Ai stored in the leaf with the support of its subpatterns by Property 
1. There may be many such subpatterns when size of Ai is large. However, we 
only need to compare its p-subpatterns since upward closure property. Hence, 
the number of such comparisons is bounded by p + 1. Once we find that the 
support of Ai equals to the support of any of its p subpattern H, we remove the 
leaf from the candidate tree. So all its super patterns will not be generated in all 
deeper layers. In this way, the number of removed weak rules may increase at an 
exponential rate. Examination of satisfaction of Lemma 1 is in the similar way, 
but it is with respect to a particular target. That is, we only remove a target 
from the potential target set in the leaf. Pruning those infrequent patterns is the 
same as that in other association rule mining algorithms. In our experiments, 
we will show the efficiency of weak rule pruning in dense databases. 

4 Experiment 

We have implemented the proposed algorithms and evaluated them on 6 real 
world databases from UCL ML Repository |SI. For those databases having con- 
tinuous attributes, we use Discretizer in El to discretize them. 

We have mined the complete class association rule sets and the optimal class 
association rule set of all testing databases with the minimum confidence of 0.5 
and the minimum support of 0.1. Here support is specified as local support that 
is defined to be the ratio of the support of a rule to the support of the rule’s 
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consequence, since significance of a rule depends much on how much proportion 
of occurrences of its consequence it accounts for. We generate the complete class 
association rule set by the same algorithm without weak rule pruning and strong 
rule selecting. We restrict the maximum layer of candidate trees to 4 because of 
the observation that too specific rules (with many conditions) usually have very 
limited prediction power in practice. In fact, the proposed algorithm performs 
more efficiently when there is no such restriction, and this is clear from the 
second part of our experiment. We do so in order to present competitive results, 
since rule length constraint is an effective way to avoid combinatorial explosion. 
Similar constraints have been used in practice, for example, m restricts the 
maximum size of the found rule set. 

The comparisons of rule set size and time to generate between the complete 
class association rule set and optimal class association rule set are listed in Figure 
1. It is easy to see that the size of a optimal class association rule set is much 
smaller than that of the corresponding complete rule set, on the average less than 
of that. Because the optimal class association rule set has the same prediction 
power as the complete class association rule set has, so this rule set size reduction 
is very impressive. Similarly, the time for generating rules is much shorter as well. 
We have obtained more than | reduction of mining time on average. Moreover, 
using a smaller optimal class association rule set instead of a lager complete class 
association rule set as the input for finding predictive association rules, we will 
have more efficiency improvement for other data mining tasks too. 



Comparison of rule size 




Ratio of the size of strong njle set to the size of complete mle eel 



Comparison of generating time 




Ratio of generating time of strong mie set to generating time of complete rule set 



Fig. 1. Overall comparisons of rule size and generating time between Ro and Rc (in 
the ratio of Ro to Rc) 



The core of our proposed algorithm is to prune weak rules. To demonstrate 
the efficiency of pruning stated in Lemma 1 and Corollary 1 on dense databases, 
we have illustrated the number of nodes in each layer of the candidate trees of 
two databases in Figure 2. In this experiment, we lift the restriction of maximum 
number of layers. We can see that the tree nodes explode at a sharp exponential 
rate without weak rule pruning. In contrast, tree nodes increase slowly with 
weak rule pruning, reach a low maximum quickly, and then decrease gradually. 
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When a pruning tree (weak rule pruning) stops growing, its corresponding un- 
pruned tree just passes its maximum. In the deep tree level, after 4 in our case, 
the nodes being pruned are more than 99%. This shows how much redundancy 
we have eliminated. In our experiment, more than 95% time is used for such 
redundant computing when there is no maximum layer restriction. Considering 
that how much time it will take if we compute strong rules after obtaining all 
class association rules, we can see how effective our proposed weak rule pruning 
criterion is. Besides, from this detailed illustration of candidate tree growing 
without length restriction, we can understand that the proposed algorithm will 
perform more efficiently when there is no maximum layer number restriction in 
comparison with mining the complete class association sets. 





Fig. 2. Comparison of the number of candidates before and after weak rule pruning 



5 Conclusion 

In the paper, we studied an important problem of efficiently mining predictive 
association rules. We defined the optimal class association rule set, which pre- 
serves all prediction power of the complete class association rule set and hence 
can be used as a replacement of the complete class association rule set for finding 
predictive association rules. We developed a criterion to prune weak rules before 
they are actually generated, and presented an efficient algorithm to mine the 
optimal class association rule set. Our algorithm avoids redundant computation 
required in mining the complete class association rule set, and hence improves 
efficiency of the mining process significantly. We implemented the proposed algo- 
rithm and evaluated it on some real world databases. Our experimental results 
show that the optimal class association rule set has a much smaller size and 
requires much less time to generate than the complete class association rule set. 
It was also shown that the proposed criterion is very effective for pruning weak 
rules in dense databases. 
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Abstract. The generation of frequent patterns (or frequent itemsets) has been 
studied in various areas of data mining. Most of the studies take the Apriori- 
based generation-and-test approach, which is computationally costly in the 
generation of candidate frequent patterns. Methods like frequent pattern trees 
has been utilized to avoid candidate set generation, but they work with more 
complicated data structures. In this paper, we propose another approach to 
mining frequent patterns without candidate generation. Our approach uses a 
simple linear list called Frequent Pattern List (FPL). By performing simple 
operations on FPLs, we can discover frequent patterns easily. Two algorithms, 
FPL-Construction and FPL-Mining, are proposed to construct the FPL and 
generate frequent patterns from the FPL, respectively. 



1 Introduction 

Mining frequent patterns finds a wide range of applications in data mining. Examples 
include mining association rules [1], correlations [4], sequential patterns [2], and so 
on. Agrawal and Srikant [1] pioneered the research hy proposing the Apriori- 
algorithm, and there were many improvements on their original works [3, 5, 8, 9]. 
These methods adopted the generation-and-test approach. That is, they iteratively 
generate the set of candidate frequent patterns of length (k-tl) from the set of frequent 
patterns of length k, and then check their support counts in the database. However, 
there are two fundamental drawbacks [6, 7] with the Apriori-like generation-and-test 
approach. First, the generation of a huge number of candidate sets is costly. Second, 
the repeated scanning of the database and the testing of candidates by pattern 
matching is time consuming. 

Han, Pei, and Yin invent a novel data structure to mine frequent patterns without 
candidate generation: the frequent pattern tree (FP-tree) [6], which is an extension of 
prefix-tree structure. The transactions in the database are encoded into FP-tree in such 
a way that each transaction corresponds to a path from the root to a transaction node 
in the tree. Based on the FP-tree, along with the associated header table, they develop 
a pattern fragment growth mining method to perform mining tasks recursively with 
such a tree. 

Although novel and efficient when compared with the Apriori-like methods, the 
FP-tree approach still has something to be improved. First, for the same frequent item, 
there are duplicated tree nodes on different branches of the tree. Second, the FP-tree 
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structure is rather complicated, and the recursive construction of conditional FP-trees 
is a nontrivial task. 

In this paper, we present another approach to mining frequent patterns without 
candidate generation: using a simpler and more straightforward structure called the 
frequent pattern list (FPL). The FPL has the following features: (1) No duplication 
for the same frequent item. (2) The data structure is simple: linear lists, which can be 
implemented by dynamic arrays. (3) The transaction database can be partitioned 
neatly when mining frequent patterns, resulting in easier management of main 
memory. 

The remaining of the paper is organized as follows. Section 2 describes the 
construction of the FPL. Section 3 details the algorithm for mining frequent patterns 
based on the FPL. Section 4 discusses our approach with other related works. Section 
5 gives the conclusion. 



2 Frequent Pattern List Construction (FPL- Construction) 

The construction of the FPL is achieved by algorithm FPL-Construction. Before 
going into the details, let’s describe the problem as follows. 



2.1 Problem Description [6, 7] 

Let I = {al, a2, ..., am} be a set of items, and a transaction database DB={T1, T2, 
..., Tnj, where Ti ( i = 1, 2, ..., n) is a transaction which contains a set of items in I. 
The support (or frequency) of a pattern P, which is a set of items in I, is the number 
of transactions containing P in DB. A pattern P, is a frequent pattern if P’s support 
is larger than or equal to a predefined minimum support threshold t. 

Given a transaction database DB and a minimum support threshold t, the problem 
of finding the complete set of frequent patterns is called the frequent pattern-mining 
problem. 



Table 1. The example transaction database, DB. 



Transaction ID 


Frequent Items 


Tl 


f, c, a, m, p 


T2 


f, c, a, b, m 


T3 


f,b 


T4 


c, b, p 


T5 


f, c, a, m, p 
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2.2 Algorithm for Frequent Pattern List Construction 

In this section, we describe the procedures to construct the frequent pattern list. The 
example transaction database (borrowed from reference [6]) shown in table 1 is used 
for illustration. For convenience, only frequent items are shown, and the frequent 
items in each transaction are listed in the order of descending frequency. 

Algorithm 1 (FPL Construction: Frequent Pattern List construction) 

Input: a transaction database DB and a minimum support threshold t. 

Output: its frequent pattern list, FPL. 

Steps: 

{ 

1. Scan the database DB. Find the frequent items and their corresponding 
frequencies. Create a linear list of item nodes of frequent items in order of 
descending frequency, with the item labels and their frequencies (counts) stored in 
the item node. The result of step 1 is shown in Figure 1 for our example. 



Item 
node 1 


Item node 
2 


Item node 
3 


Item 
node 4 


Item 
node 5 


Item 
node 6 


f: 4 


c: 4 


a: 3 


b: 3 


m : 3 


p: 3 



Fig. 1. The frequent pattern list (FPL) after step 1. 

2. For each transaction Tx in DB, do the following: 

1) Select and sort the frequent items according to the order in the FPL. 

2) Starting from the root, traverse the FPL and compare the items in the FPL item 
nodes with the items in Tx. A bit string, called transaction signature, is formed 
from left to right to indicate the existence and absence of frequent items in Tx as 
follows: At an item node. If there is a corresponding item in Tx, set the bit to 1; 
otherwise, set it to 0. When all the frequent items in Tx are examined, a transaction 
node (T-node) containing the transaction signature is attached to the item node that 
corresponds to the rightmost item in Tx (i.e., the item with least frequency in Tx). 
This resultant list, shown in Figure 2, is called frequent pattern list (FPL). 

} 
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Fig. 2. The complete (global) FPL constructed from DB of table 1. 
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2.3 Properties of FPL 

From the structure of FPL, we observe the following properties: 

1. The transactions under item node k must contain item k, may contain items to 
the left of item k (items with frequencies no less than that of item k), and will not 
contain items to the right of item k (items with frequencies no more than that of item 
k). Since item k is the rightmost item contained in the transactions under node k, the 
LSBs (least significant bits) of their signatures must be 1. 

2. The transaction database can be partitioned by the FPL according to the above 
criteria. All the transactions under the same item node belong to the same partition. 

3. All the bit strings of the transactions under item node k have the same length k, 
which is also the path length from the root to item node k. 

4. Since FPL is built in descending order of item frequencies, transactions 
containing less frequent items will have longer bit strings. That is, these transactions 
should be attached to the item nodes farther away from the root. But the number of 
these transactions must be small, since the rightmost items contained in these 
transactions are less frequent items with smaller support count. 

5. Likewise, transactions containing only more frequent items will have shorter bit 
strings. That is, these transactions should be attached to the item nodes closer to the 
root. Although the number of these transactions must be large, they will not consume 
much memory because of the short lengths of their bit strings. 



3 Mining Frequent Patterns with the FPL 

After constructing the FPL, we are able to discover frequent patterns by performing 
simple operations on it. For this purpose, we devise algorithm FPL Mining, which 
can generate frequent patterns in a very straightforward way. An example is also 
given to illustrate the complete mining process. 



3.1 Basic Operations 

Frequent patterns can be discovered from the FPL by performing simple operations 
on the transaction signatures associated with its rightmost item node, which 
corresponds to the item with smallest frequency. These operations are described as 
follows: 

Bit counting: for each bit position, count the number of 1-bits. 

Signature trimming: since the last bit (LSB) of each signature must be 1 (refer to 
section 2.3, property 1), it can be removed without losing information. After this, the 
trailing 0-bits of the signature are also removed. 

Signature migration: from the least significant 1-bit of the trimmed signature, find 
the corresponding item node, and migrate this trimmed signature to that item node in 
the FPL. 
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3.2 Algorithm for Mining Frequent Patterns with FPL 

In this section, we give the algorithm for mining frequent patterns on the FPL. The 
example in section 2 is used here for explanation. 

Algorithm 2 (FPL mining: mining frequent patterns with FPL) 

Input: FPL constructed based on Algorithm 1, using DB and a minimum support 
threshold t. 

Output: the complete set of frequent patterns. 

Steps: Procedure FP-mining {FPL, t) 

{ 

1). Bit counting 

Visit the item node at the end of the FPL. Generate a pattern whose label and count 
are the same as those stored in this item node. For all the transaction signatures under 
this item node, conduct hit counting on all bit positions other than the least 
significant bit (LSB) position. Ignore the bit positions whose bit counts are below the 
minimum support threshold. The LSB position, which corresponds to the last item 
node, is also ignored. Figure 3-1 shows the details for the example, assuming the 
minimum support threshold is 3. 
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Fig. 3-1. The result of bit counting on the last partition (item node p) for figure 2. Pattern 
generated: (p: 3). Remaining bit: bit 4 (item c). 

2). Output more patterns directly or by recursive calls 

For the remaining bits: 

If all their bit counts are equal to the count in the end item node 
Then Produce all combinations of the items corresponding to these remaining 
bits. Generate patterns by concatenating each of the combinations with the item of 
the last item node, with pattern counts being equal to the count in the end item node; 
Else 

(i) For each transaction signature associated with the end item node: 

The least significant bit (which is a 1-bit) is removed. The signature can be 
discarded if it contains no more 1-bits. 

(ii) All the remaining signatures, with their LSBs removed, are then used as input to 
Algorithm 1 to construct a sub FPL FPL^ub (a FPL one order lower than its parent). A 
recursive call to FP-mining {FPL^ub, Q is then made to generate frequent patterns. 
The patterns generated from this recursive call must be concatenated with the last 
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item of the parent FPL to form the final frequent patterns. The count of the final 
frequent pattern is the same as the count of the pattern produced by the recursive call. 

For our example, there is only one remaining bit, and we get pattern (cp: 3). 

No recursive call is made in this case. 

3). Signature trimming and migration 

For all the transaction signatures associated with the end item node: 

Bit 0 (the LSB corresponding to the last item in FPL) is trimmed. 

Find the next nonzero least significant bit, and migrate to the item node corresponding 
to this bit. All the trailing zero bits can be trimmed. The last item node of the FPL is 
removed. For our example, the resulting FPL, FPL, dimmed , is shown in Figure 3-2. 
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Fig. 3-2. The resulting FPL after signature trimming and migration from the FPL in Figure 1. 
Item node p (item node 6) is removed. 

4). If there is only node in the trimmed FPL (that is, the starting node becomes the 
final remaining node in the FPL) 

Then the mining process stops by generating the final pattern whose label and count 
are the same as those stored in this item node; 

Else go to step 1 with the trimmed FPL as input. 

} 



3.3. Details of the Mining Process 



Remaining details include figure 3-3 to figure 3-11. The mining process repeats the 
cycle of bit counting, signature trimming and migration. Recursive calls are made, if 
necessary. 
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Fig. 3-3. The result of bit counting for the FPL in Figure 3-2. All the surviving bits (for items f, c, 
a) have the same count: 3. Patterns generated: (m: 3), (am: 3), (fm: 3), (cm: 3), (cam: 3), (fam: 3), 
(fcm: 3), (fcam: 3). 
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Fig. 3-4. The result of signature trimming and migration for figure 3-2. Item node m is 
removed. 
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Fig. 3-5. The result of bit counting for figure 3-4. All the remaining bits (for items f, c, a) 
have count less than 3. Patterns generated: only (b: 3).. 
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Fig. 3-6. The result of signature trimming and migration for figure 3-4. Item node b is removed. 
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Fig. 3-7. The result of bit counting for figure 3-6. All the remaining bits (for items f, c) have 
the same count: 3. Patterns generated: (a: 3), (ca: 3), (fa: 3), (fca: 3). 



Item node I 


Item node 2 


f:4 


c: 4 


T3 I 


T4 01 




Tl II 




T5 II 




T2 II 



Fig. 3-8. The result of signature trimming and migration for the FPL in Figure 3-6. Item node a 
is removed. 
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Fig. 3-9. The result of bit counting for figure 3-8. The remaining bit (item f) has the count 3, 
but the count of the last item node (item c) has a count of 4. Pattern generated: (c: 4). T4 is 
discarded since it ends at item c. Signatures Tl, T5, and T2, with their LSBs removed, are used 
to construct a sub FPL. A recursive call is made using this sub FPL as input to find frequent 
patterns. 
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Fig. 3-10. The resulting sub-FPL from the recursive call of figure 3-9. Pattern generated: (f: 3), 
which must be concatenated with item c, the last item of the parent FPL. The final pattern 
generated: (fc: 3). 
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Fig. 3-11. The result of signature trimming and migration for Figure 3-8. Only one pattern is 
generated: (f: 4). The mining process stops. 



4 Discussions 

In this section we compare our FPL approach with other previous works: the Apriori- 
like algorithms and the FP-tree algorithms. 

As for the issue of recursive structure of the algorithms, both FP-tree and FPL are 
recursive by nature. The Apriori-like algorithms, although iterative in appearance, still 
rely on recursive structures when checking the candidate frequent patterns. 

About the order of generating frequent patterns, the Apriori-like algorithms use 
“size order”; that is, itemsets of smaller sizes are generated, and then itemsets of 
larger sizes are derived from them. This makes the frequent pattern-mining 
problem a holistic one: to check the validity of a frequent pattern, the entire database 
has to be scanned. The database cannot be segmented for the mining task. The FP-tree 
and FPL approaches, on the other hand, use the “frequency order.” Patterns containing 
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less frequent items are generated first, and then patterns with only more frequent 
items are generated. Using the well-defined structures of FP-tree and FPL, this mining 
order can he realized by partitioning the database, resulting in a divide-and-conquer 
methodology. For FPL, only the transaction signatures under the rightmost item node 
are required for mining frequent patterns at each mining stage. 

When the database is huge and complicated, our algorithm allows the partition of 
the database (see property 2 in section 2.3). Moreover, since our FPL algorithm does 
not generate candidate patterns, the computation time scales linearly with the size of 
the database, as does FP-tree proposed in [6], rather than scales exponentially with the 
size of the database when Apriori-like algorithms are used. 



5 Conclusions 

In this paper we proposed a simple structure, the frequent pattern list (FPL), for 
storing information about frequent patterns, discussed the properties of FPL, and 
developed simple operations on FPL for mining frequent patterns. 

Features of FPL are: (1) No duplication for the same frequent item. (2) The data 
structure is simple: linear lists. (3) The operations are simple: bit operations (bit 
counting and signature trimming), and signature removal and appending (called 
signature migration). Therefore, dynamic arrays can be used to implement the FPL 
structure. (4) The transaction database can be partitioned neatly when mining 
frequent patterns, resulting in easier management of main memory. 

There are several issues related to FPL-based mining. For example, more efficient 
algorithms for performance improvement should be studied. Also, the FPL approach 
can be applied to other applications like the mining of user profiles. 
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Abstract. Discovering interesting associations of events is an important 
data mining task. In many real applications, the notion of association, 
which defines how events are associated, often depends on the particular 
application and user requirements. This motivates the need for a general 
framework that allows the user to specify the notion of association of 
his/her own choices. In this paper we present such a framework, called 
the UDA mining [User-Defined Association Mining). The approach is 
to dehne a language for specifying a broad class of associations and yet 
efficient to be implemented. We show that (1) existing notions of associ- 
ation mining are instances of the UDA mining, and (2) many new ad-hoc 
association mining tasks can be defined in the UDA mining framework. 



1 Introduction 

Interesting association patterns could occur in diverse forms. Early work has 
defined and mined associations of different notions in separate frameworks. For 
example, association rules are defined by confidence/support and are searched 
based on the Apriori pruning i correlation rules are defined by the statistics 
test and are searched based on the upward-closed property of correlation jlj; 
causal relationships are defined and searched by using CCC and CCU rules 
H3; emerging patterns are defined by the growth ratio of support jO]. With 
such an “one-framework-per-notion” paradigm, it is difficult to compare different 
notions and identify commonalities among them. More importantly, the user may 
not find such pre-determined frameworks suitable for his/her specific needs. For 
example, at one time the user likes to find all pairs < p, c > such that p is 
some above mentioned association pattern and c is a condition under which 
p occurs; at another time the user likes to know all triples < p, ci,C2 > such 
that association pattern p occurs in the special case c\ but not in the general 
case C2; at yet another time the user wants something else. Even for this simple 
example, it is not clear how the above existing frameworks can be extended to 
such “ad-hoc” mining. The topic of this paper is to address this extendibility. 

Our approach is to propose a language in which the user himself/herself 
can define a new notion of association (vs. choose a pre-determined notion). 
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In spirit, this is similar to database querying in DBMS, in that it does not 
predict the mining tasks that the user might require to perform; it is the ex- 
pressive power of the language that determines the class of associations speci- 
fiable in this approach. The key is the notion of “user-defined associations” 
and its specification language. Informally, a user-defined association has two 
components, events and their relationship. An event is a conjunction of atomic 
descriptors, called items, for transactions in the database. For example, event 
FEM ALE A YOUNG A MANAGER is a statement about individuals, where 
items E EM ALE, YOUNG, and MANAGER are atomic descriptors of indi- 
viduals 0 . A relationship is a statement about how events are associated. We 
illustrate the notion of “user-defined associations” through several examples. 

Example I. A liberate notion of association between two events X and Y 
can occur in the form that X causes Y. As pointed out by such causal rela- 
tionships, which state the nature of the relationships, cannot be derived from the 
classic association rules X ^ Y. m has considered the problem of mining causal 
relationships among 1-item events, i.e., events containing a single item. In the in- 
terrelated world, causal relationships occur more often among multi-item events 
than among 1-item events. For example, 2-item event MALE A POSTGRAD 
more likely causes HIGHNNGOME than each of the 1-item events MALE 
and POSTGRAD does. Though the concept of causal relationships remains 
unchanged for multi-item events, the search for such causal relationships turns 
out to be more challenging because it is unknown in advance which items form 
a meaningful multi-item event in a causal relationship. In our approach, such 
general causal relationships are modeled as a special case of user-defined associ- 
ations. 

Example II. The user likes to know all three events Z\,Z 2 ,X such that 
X is more “associated” with Z\ than with Z^, where the notion of association 
between X and Zi could be any user-defined associations. For example, if A = 
HIGH UN GOME is more correlated with Z\ = POSTGRAD A MALE than 
with Z 2 = POSTGRAD A FEMALE, the user could use it as an evidence of 
gender discrimination because the same education does not give woman the same 
pay as man. Again, multi-item events like POSTGRAD A MALE are essential 
for discovering such associations. 

Example III: Sometimes, the user likes to know all combinations of events 
Zi, Z 2 , Xi, . . . , Xk such that the association of k events Xi, . . . , A^, in whatever 
notion, has sufficiently changed when the condition changes from Z\ to Z 2 . For 
example, Ai = BEER and X 2 = GHIPS could be sold together primarily 
during Zi = [6PM, 9PM] A WEEKDAY. Here, Z 2 is implicitly taken as 0, 
representing the most general condition. 

This list can go on, but several points have emerged and are summarized 
below. 

1. User-defined associations. A powerful concept in user-defined association is 

that the user defines a class of associations by “composing” existing user- 



^ A better term for things like FEMALE, YOUNG, and MANAGER is perhaps 
“feature” or “variable”. We shall use the term “items” to be consistent with [Q. 
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defined association. The basic building blocks in this specification, such as 
support, confidence, correlation, conditional correlation, etc., may not be 
new and, in fact, are well understood. What is new is to provide the user 
with a mechanism for constructing a new notion of association using such 
building blocks. 

2. Unified specification and mining. A friendly system should provide a single 
framework for specifying and mining a broad class of notions of association. 
We do not expect a single framework to cover all possible notions of associ- 
ation, just as we do not expect SQL to express all possible database queries. 
What we expect is that the framework is able to cover most important and 
typical notions of association. We will elaborate on this point in Section 3. 

3. Completeness of answers. The association mining aims to find all associa- 
tions of a specified notion. In contrast, most work in statistics and machine 
learning, e.g., model search UHl and Bayesian network learning m , is pri- 
marily concerned with finding some but not all associations. To search for 
such complete answers, those approaches are too expensive for data sets with 
thousands of variables (i.e., items) as we consider here. 

4. Unspecified event space. The event space is not fixed in advance and must be 
discovered in the search of associations. This feature is different from 
where only 1-item events are considered. Given thousands of items and that 
any combination of items is potentially an event, it is a non-trivial task to 
determine what items make a meaningful event in the association with other 
events. This task is further compounded by the fact that any combination 
of events is potentially an association. 

In the rest of this paper, we present a unified framework for specifying and 
mining user-defined associations. The framework must be expressive enough for 
specifying a broad class of associations. In Section 2 and Section 3, we pro- 
pose such a framework and examine its expressive power. Equally important, 
the mining algorithm must have an efficient implementation. We consider the 
implementation in Section 4. We review related work in Section 5 and conclude 
the paper in Section 6. 



2 User-Defined Association 

2.1 Definitions 

The database is a collection of transactions. Each transaction is represented by 
a set of Boolean descriptors called items that hold on the transaction. An event 
is a conjunction of items, often treated as a set of items. We do not consider 
disjunction in this paper. 0, called the empty event, denotes the Boolean constant 
TRUE or the empty set. Given the transaction database, the support of an event 
X, denoted P{X), is the fraction of the transactions on which event X holds, 
or of which A is a subset. An event X is large if P{X) > minisup for the 
user-specified minimum support minisup. Events that are not large occur too 
infrequently, therefore, do not have statistical significance. The set of large events 
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is downward-closed with respect to the set containment P: if X is large and X' is 
a subset of X, X' is also large. For events X and Y, XY is the shorthand for event 
X /\Y or Y U Y, and P{X\Y) for P{XY) / P{Y). Thus, Xi, ... ,X^ represents 
k events whereas Xi . . . Xk represents one event A . . . A Xk- The notion of 
support can be extended to absence of events. For example, P{X^Y^Z) denotes 
the fraction of the transactions on which X holds but neither Y nor Z does. 

A user-defined association is written as Z\, . . . , Zp X\, ... , Xk - Xi, . . . , Xk 
are called subject events, whose association is of the primary concern. Z\, . . . , Zp 
are called context events, which provide p different conditions for comparing 
the association of subject events. Context events are always ordered because 
the order of affecting the association is of interest. The notion of user-defined 
association Zi, ... ,Zp — >• Xi, ... ,Xk is defined by the support filter and the 
strength filter defined below. 

— Support -Filter. It states that events Xi,...,Xk,Zi must occur together 
frequently: if p > 0, P{Xi...XkZi) > minisup for 1 < i < p; or if 
p = 0, P{Xi . . . Xk) > mini_sup. In other words, Xi . . . XkZ^, or Xi . . . Xk 
if p = 0, is required to be a large event. If this requirement is not sat- 
isfied, the co-occurrence of Xi under condition Zi does not have statisti- 
cal significance. This condition is called the support filter and is written as 
Support-Filter{z\, . . . , Zp ^ x\, . . . , Xk), where Zi and Xi are variables rep- 
resenting the events Zi and Xi in a user-defined association. 

— Strength-Filter: It states that events Zi, . . . , Zp, Xi, . . . , Xk must hold the 
relationship specified by a conjunction of one or more formulas of the form 
Tpi > mini-stri. Each ifi measures some strength of the relationship and 
mini-stri is the threshold value on the strength. This conjunction is called 
the strength filter and is written as Strength -Filter{z\, . . . , Zp — >■ xi, . . . ,Xk), 
where Zi and Xi are variables representing the events Zi and Xi. 

In the above filters, variables Xi and Zi can be instantiated by events Xi 
and Zi, and the instantiation is represented by Support-Filter{Zi, . . . , Zp — >■ 
Xi,...,Xk) and Strength -Filter {Z\,..., Zp — >■ Xi,...,Xk). Observe that 
Support -Filter (Z I , . . . , Zp ^ Xi, . . . , Xk) implies that each Xi and Zi is a large 
event because of the downward-closed property mentioned earlier. It remains to 
choose a language for specifying ijii, which will determine the class of associa- 
tions specified and the efficiency of the mining algorithm. We will study this 
issue shortly. For now, we assume that such a language is chosen. As a conven- 
tion, we use lower case letters z\, . . . , Zp,x\, . . . ,Xk for event variables and use 
upper case letters Z\,. . . , Zp, Xi, . . . , Xk for events. 

Definition 1 (The UDA specification). A user-defined association specifi- 
cation (UDA specification), written as UDA{z\, . . . , Zp ^ x\, . . . , Xk), k > 0 and 
p > 0, has the form Strength -Filter{z\, . . . , Zp — >■ xi, . . . , Xk)/\ Support -Filter 
{zi,...,Zp ->• Xi,...,Xk). (The End) 

We say that a UDA specification is symmetric if variables xfs are sym- 
metric in Strength -Filter (note that variables xfs are always symmetric in 



User-Defined Association Mining 391 



Support -Filter); otherwise, it is asymmetric. A symmetric specification is desir- 
able if the order of subject events does not matter, such as correlation. Otherwise, 
an asymmetric UDA specification is desirable. For example, an asymmetric UDA 
is that whenever events Zi, . . . , Zp occur, Xi occurs but not A 2 . We consider 
only symmetric specification, though the work can be extended to asymmetric 
specification. 

Definition 2 (The UDA problem). Assume that UDA{zi, . . . , Zp —>■ 

xi,...,Xk) is given for 0 < k < k' , where p{> 0) and k'{> 0) are speci- 

fied by the user. Consider distinct events Zi, . . . , Zp, Ai, . . . , A^. We say that 
Zi, ..., Zp —>■ Ai, ... , Aj, is a UDA if the following conditions hold: 

1. Xi n Aj = 0, i j, and 

2. Xi n Zj = 0, i j, and 

3. UDA{Zi, . . . ,Zp ^ Ai, . . . , Afc) is true. 

k is called the size of the UDA. Zi, . . . , Zp — ^ Ai, . . . , A^ is minimal if for any 
proper subset {Xi^,...,XiJ of {Ai,...,Afc}, Zi, . . . , Zp -)> A^^ , . . . , A^^ is not 
a UDA. The UDA problem is to find all UDAs of the specified sizes 0 < k < k'. 
The minimal UDA problem is to find all minimal UDAs of the specified sizes k. 
(The End) 

Several points about Definition |21 are worth noting. 

First, the number of context events, p, in a UDA Zi, . . . , Zp — >■ X\, . . . , A^ is 
fixed whereas the number of subject events, k, is allowed up to a specified maxi- 
mum size kj^. This distinction comes from the different roles of these events: for 
subject events we do not know a prior how many of them may participate in an 
association, but we often examine a fixed number of conditions for each associ- 
ation. It is possible to allow the number of conditions p up to some maximum 
number, but we have not found useful applications that require this extension. 

Second, context events Z^’s are not necessarily pairwise disjoint. In fact, 
it is often desirable to examine two context events Z\ and Z 2 such that Z\ 
is a proper superset, thereby a specialization, of Z 2 . Then we could specify 
UDAs Zi,Z 2 Ai,...,Afc such that the association of A^’s holds under the 
specialized condition Z\ but not under the general condition Z 2 . Other useful 
syntax constraints could be the requirement on the presence or absence of some 
specified items in an event, a certain partitioning of the items for context events 
and subject events, the maximum or minimum number of items in an event, etc. 
In the same spirit, the disjointness in condition 2 can be removed to express 
certain overlapping constraints. Constraints have been exploited to prune search 
space for mining association rules 1HI3|. A natural generalization is to exploit 
syntax constraints for mining general UDAs. In this paper, however, we focus 
on the basic form in Definition |2 

2.2 Examples 

In this section, we intend to achieve two goals through considering several ex- 
amples of UDA specification: to show that disparate notions of association can 
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be specified in the UDA framework, and to readily convey the basic idea that 
underly more complex specification. Once these are understood, the user can 
define any notion of association of his/her own choice, in the given specifica- 
tion language. We shall focus on specifying Strength_F liter because specifying 
Support _Filter is straightforward. In all examples, lower-case letters Zi and Xi 
represent event variables and upper-case letters Zi and Xi represent events. 

Example 1 (Association rules). Association rules Z ^ X introduced in ^ can 
he specified by 

Support _Filter{z — >■ x) : P{zx) > mini_sup 
Strength_Filter{z — >■ x) : ip{z,x) > mini-conf , 

where ip{z,x) = P{x\z) = P{xz) / P{z) is the confidence of rule Z ^ X m- To 
factor in both “generality” and “predictivity” of rules in a single measure, the 
following fi!{z,x), called the J-measure m, can be used: 

P{z)[P{x\z)log2^^p^ + (1 - P{x\z))log2^Y^^^^- 

Here, P{z) weighs the generality of the rule and the term in the square bracket 
weighs the “discrimination power” of z on x. This example shows how easy it is 
to adopt a different definition of association in the UDA framework. (The End) 



In the above specification, the most basic constructs are the supports P{Z), 
P{X), P{ZX), P{->X), P{Z->X). Since Support _Filter{Z — )> X) implies that 
each of Z, X, ZX is a large event, these supports are readily available from mining 
large events (note that P{->X) = 1 — P{X) and P{Z~iX) = P{Z) — P{ZX)). 



Example 2 (Multiway correlation). The notion of correlation is a special case of 
UDAs without context event. In particular, events Xi, . . . ,Xk are correlated if 
they occur together more often than expected when they are independent. This 
notion can he specified by the yfi statistic test, - > Xa MF- 

R = {xi, -'Xi} X ... X {a;^, -^Xk) and r = r\ . . .r^ ^ R. Let E{r) = N * P{r\) * 

. . .*P{rj.), where N is the total number of transactions. x^{x\, . . . ,Xk) is defined 
by: 






{N * P{r) - E{r)fi 

W) 



The threshold value Xa ® user-specified significance level a, usually 5%, can 
be obtained from statistic tables for the x^ distribution. If X\, ..., Xk passes 
the test, Xi, . . . ,Xk are correlated with probability 1 — a. The uncorrelation of 
Xi, . . . , Xk can be specified by Strength ^Filter of the form l/x^(xi, . . . , Xk) > 
VXa; where a is usually 95%. If Xi, . . . ,Xk passes the filter, Xi, . . . ,Xk are 
uncorrelated with probability a. (The End) 



The problem of mining correlation among single-item events was studied 
in m- One difference of correlation specified as UDAs is that each event Xi 
can involve multiple items, rather than a single item. One such example is 
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the correlation of Xi = INTERNET and X2 = YOUNG A MALE, where 
YOUNG A MALE is a 2 -item event. This generalization is highly desirable 
because single-item events like Xi = INTERNET and X2 — YOUNG or 
X\ = INTERNET and X2 = MALE may not be strongly corrected. It is not 
clear how the mining algorithms in m can be extended to multi-item events. 
A more profound difference, however, is that, as UDAs, we can model “ad-hoc” 
extension of correlation. The subsequent examples shows this point. 



Examples (Conditional association). In conditional association Z — > 
Xi, . . . , Xk, subject events X\, . . . ,Xk are associated when conditioned on Z. 
For example, 

INT'L A BUSINESS-TRIP CEO, FIRST-CLASS 

says that Xi = CEO and X2 = FIRST-CLASS (flights) are associated for 
Z = INT'L A BUSINESS-TRIP (international business trips). For example, 
if the association of Xi, . . . , X^ is taken as the correlation, we have conditional 
correlation defined by Strength-Filter{z -A x\,. .. , Xk): 



P{xi ...Xk\z) 
P{xi\z) * ...* P{xk\z) 



> mini-str 



( 1 ) 



or alternatively, by the yfl statistic test after replacing P{r) and P{ri) in 
X^{Xi, . . ., Xk) > x^a P{A^) (The End) 



Example 4 (Comparison association). In comparison association Z\,Z2 — t X, 
subject event X is associated differently with context events Z\ and Z2. For 
example, 

INT'L A BUSINESS-TRIP, PRIVATE-TRIP CEO A EIRST -CLASS 

says that X = CEO A F I RST -C LAS S is more associated with Z\ = INT'L A 
BUSINESS-TRIP than with Z2 = PRIVATE-TRIP. To compare two as- 
sociations for difference, we can compare their corresponding strength ijjj in 
Strength -Filter. In particular, suppose that UDA{-^ Zi,x) specifies the associa- 
tion of Zi and X, i = 1 , 2 . For each ipj(zi,x) in UDA{-^ Zi,x), Strength-Filter 
{z\,Z2 — t x) for the comparison association contains the formula: 

Dist{'ipj{zi,x),'ipj{z2,x)) > mini-strj. ( 2 ) 

Here, Dist{si, S2) measures the distance between two strengths Si and S2. Typical 
distance measures are Dist{s\, S2) = s\/ S2 or Dist{s\, S2) = s\ — S2. (The End) 



Examples (Emerging association). In emerging association Z\,Z2 — t 

Xi, . . . , Xk, the association of X\, . . . , Xk has changed sufficiently when the con- 
dition changes from Z\ to Z2. Suppose that UDA{zi -A Xi, . . . ,Xk) specifies the 
conditional association Zi — >■ X\, . . . ,Xk, i = 1 , 2 , as in Example^ Then, for 
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each strength function ifj in UDA{zi — >■ Xi, . . . ,Xk), Strength_Filter{zi, Z2 —>■ 
xi, . . . ,Xk) for emerging association contains the formula: 



Dist{'ipj{zi,xx, . . . ,Xk),ipj{z2,xi, . . . ,Xk)) > mini-strj. (3) 

The notion of emerging association is useful for identifying trends and changes. 
For example, the notion of emerging patterns m is a special case of emerging 
associations of the form Zi, Z2 — >■ X , where Z\ and Z2 are identifiers for the two 
originating databases of the transactions. An emerging pattern Zi, Z2 ^ X says 
that the ratio of the support of X in the two databases identified by Z\ and Z2 
is above some specified threshold. To specify emerging patterns, we first merge 
the two databases into a single database by adding item Z^ to every transaction 
coming from database i, i = 1,2, and specify ifj{zi,x) = P{zix) / P{zi), where Zi 
are variables for Zi, and Dist{s\, S2) = s\/s2 in Equation\^ With the general 
notion of emerging association, however, we can capture a context Zi as an 
arbitrary event (not just a database identifier) and the participation of more 
than one subject event. (The End) 

In Examples HQ and 0 the “output” UDAs (i.e., conditional association, 
comparison association, emerging association) are defined in terms of “input” 
UDAs. These input UDAs are of the form — >■ Xi, . . . , X^ in Example ^ ^ Zi, X 
in Example Q and Zi — >■ Xi , . . . , X^ in Example 0 which themselves can be 
defined in terms of their own input UDAs. The output UDAs can be the input 
UDAs for defining other UDAs. In general, new UDAs are defined by “com- 
posing” existing UDAs. It is such a composition that provides the extendibility 
for defining ad-hoc mining tasks. We further demonstrate this extendibility by 
specifying causal relationships. 

Example 6 (Causal association). Information about statistical correlation and 
uncorrelation can be used to constrain possible causal relationships. Eor example, 
if events A and B are uncorrelated, it is clear that there is no causal relationship 
between them. Eollowing this line, JH identified several rules for inferring causal 
relationships, one of which is the so-called CCC rule: if events Z,Xi,X2 are 
pairwise correlated, and if Xi and X2 are uncorrelated when conditioned on Z, 
one of the following causal relationships exists: 

XiZ^Z^ X2 Xi^ Z ^X 2 Xi^ Z ^X 2, 

where <J= means “is caused by” and means “causes”. Eor a detailed account 
of this rule, please refer to m- We can specify the condition of the CCC rule 
by Strength _Filter{z — >■ xi,X2): 

xi,X2) > mini.stri A xi,z) > ministriA 
'0i(0, X2, z) > mini-stri A 4’2{z, xi,X2) > ministr2. 

Here, %fi{w,u,v) > ministri tests the correlation of u and v conditioned on w, 
and ip2{'w,u,v) > ministr2 tests the uncorrelation of u and v conditioned on 
w. These tests were discussed in Examples\^and\^ (The End) 
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In all the above examples, the basic constructs used by the specification are the 
support of the form P{v), where u is a conjunction of terms Zi,-<Zi, Xi,-<Xi. 
This syntax of v completely defines the language for Strength-Filter because 
we make no restriction on how P{v) should be used in the specification. In the 
next section, we define the exact syntax for v. 



3 The Specification Language 

The term “language” is more concerned with what it can do than how it is pre- 
sented. There are two considerations in choosing the language for strength func- 
tions ipi- First, the language should specify a wide class of association. Second, 
the associations specified should have an efficient mining algorithm. We start 
with the efficiency consideration. To specify UDAs Zi,...,Zp — >• Xi, . . . ,Xk, 
we require each strength ipi to be above some minimum threshold, where ipi 
is a function of P{v) and is a conjunction of Xi and Zj. The support filter 
P{Xi . . . XkZj) > mini-Sup is used to constrain the number of candidate Zj and 
Xi. The support filter implies that a conjunction v consisting of any number of 
subject events Xi and zero or one context event Zj is large. Therefore, supports 
P{v) for such V are available from mining large events if we keep the support for 
each large events. 

The question is whether it is too restrictive to allow at most one Zj in each 
V. It turns out that this is a desirability not a restriction. In fact, each Zj serves 
as an individual context for the association of Xi, . . . , and there is no need 
to consider more than one Zj at a time. Another question is that, if absences 
of events, i.e., ~'Xi and ~<Zj, are desirable in v, as in the examples in Section 
2.2, can P{v) be computed efficiently? The next theorem, which is essentially a 
variation of the well known “inclusion-exclusion” theorem, shows that such P{v) 
can be computed by the supports involving no absence of events. 

Theorem 1 . Let V = {Vi, . . . , Vq} be q events of the form Xi or Zi. Let U be 
a conjunction of events that do not occur in V . Then 

P(C/-Ui ... -U,) = Awcy(- 1)1^1 P(C/IF), 
where \W\ denotes the number of VPs in W. (The End) 

For example, assume that V = {X2,Zi} and U = {Ai}, we have P(Ai-iA 2 
-'Zi) = P{Xi) — P{XiX2) — P{XiZi) + P{XiX2Zi). This rewriting conveys two 
important points: (1) the right-hand side contains no absence, (2) if X1X2Z1 
is large (as required by the support filter), the right-hand side contains only 
supports of large events, thus, is computable by mining large events, containing 
no absence. Based on these observations, we are now ready to define the syntax 
of V for supports P(w) that appear in a strength function. 

Definition 3 (Individual-context assumption). Let Zj be a context event 
and Xi be a subject event, v satisfies the ICA (Individual-Context Assumption) 
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if n is a conjunction of zero or more terms of the form Xi and -<Xi^ and zero 
or one term of the form Zj and ^Zj. A support P{v) satisfies the ICA if v 
satisfies the ICA. A strength function '0 satisfies the ICA if it is defined using 
only supports satisfying the ICA. A strength filter satisfies the ICA if it uses only 
strength functions satisfying the ICA. The ICA-language consists of all UDAs 
defined by the support filter in Section 2.1 and the strength filter satisfying the 
ICA. (The End) 

For example, X1X2, -1A1A2, X1X2Z1, and XiX2~^Zi all satisfy the ICA 
because each contains at most one term for context events, but AiA2^i.^2 and 
XiX2Zi^Z2 do not. We like to point out that the ICA is a language on support 
P{v), not a language on how to use P(v) in defining a strength function ip. 
This total freedom on using such P{v) allows the user to define new UDAs by 
composing existing UDAs in any way he/she wants, a very powerful concept 
illustrated by the examples in Section 2 . 2 . In fact, one can verify that all the 
strength functions ip in Section 2.2 are specified in the ICA-language. 

We close this section by making an observation on the “computability” of 
the ICA-language. The support filter implies that any absence-free v satisfying 
the ICA is a large event. The rewriting by Theorem C] preserves the ICA because 
it only eliminates absences. Consequently, the ICA-language ensures that all 
allowable supports can be computed from mining large events. This addresses 
the computational aspect of the language. 



4 Implementation 

We are interested in a unified implementation for mining UDAs. Given that any 
combination of items can be an event and any combination of events can be 
a UDA, it is only feasible to rely on effective pruning strategies to reduce the 
search space of UDAs. Due to the space limitation, we sketch only the main ideas. 
Assume that the items in an event are represented in the lexicographical order 
and that the subject events in a UDA are represented in the lexicographical order 
(we consider only symmetric UDA specifications). We consider p > 0 context 
events; the case of p = 0 is more straightforward. Our strategy is to exploit 
the constraints specified by Support _Filter and Strength-Filter as earlier as 
possible in the search of UDAs. The first observation is that Support -Filter 
implies that all subject events Xi and context events Zi are large. Thus, as the 
first step we find all large events, say by applying Apriori ^ or its variants. We 
assume that the mined large events are stored in a hash-tree Q or a hash table 
so that the membership and support of large events can be checked efficiently. 

The second step is to construct UDAs using large events. For each UDA 
Zi,...,Zp — >■ Xi,...,Xk, Support -Filter requires that Xi...XkZi be large 
for 1 < i < p. Therefore, it suffices to consider only the k-tuples of the form 
(A"i, . . . , Afe, Zi), where XPs are in the lexicographical order and X\ . . . XkZi 
makes a large event for all 1 < i < p. We can generate such A:-tuples and UDAs 
of size fc in a level- wise manner like Apriori by treating events as items: In the 
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fcth iteration, a fc-tuple (Xi, . . . , Xk, Zi) is generated only if {X\, . . . Xk-i,Xk) 
and {Xi , . . . , Xk-i, Zi) were generated in the {k— l)th iteration and Xi . . . X^Zi 
is large. The largeness of ATi . . . Xj~Zi can be checked by looking up the hash- 
tree or hash table for storing large events. Also, the disjointness of Xk and Zi, 
required by Strength_Filter, and the lexicographical ordering of Ai, . . . , Xk, can 
be checked before generating tuple {X\, . . . ,Xk, Zi). After generating all fc-tuples 
in the current iteration, we construct a candidate UDA Z\, . . . , Zp X\, ... , Xk 
using p distinct tuples of the form (Ai, . . . , Xk, Zi), i = 1, ... ,p, that share the 
same prefix Ai, . . . , A^. Any further syntax constraints on Zi, as discussed in 
Section 2.1, can be checked here. A candidate Zi,...,Zp — >■ Ai,...,Afe is a 
UDA if UDA{Zi, . . . , Zp ^ Xi, . . . , Xk) holds. The above is repeated until some 
iteration k for which no fc-tuple is generated. 

For mining minimal UDAs, a straightforward algorithm is to first generate 
all UDAs and then remove non- minimal UDAs. A more efficient algorithm is 
finding all minimal UDAs without generating non- minimal UDAs. The strategy 
is to consider subsets of {Ai, . . . , A^} for subject events A^’s before considering 
{Ai, . . . , Afe} itself, and prune all supersets from consideration if any subset is 
found to be a UDA. Since this is essentially a modification of the above algorithm, 
we omit the detail in the interest of space. 

We have conducted several experiments to mine the classes of UDAs consid- 
ered in Section 2.2 from the census data set used in CH, which contains 63 items 
and 126,229 transactions. The result is highly encouraging: it discovers several 
very interesting associations that cannot be found by existing approaches. For ex- 
ample, some strong causal associations were found among general k-item events, 
as discussed in Example I, but were not found in HD because only 1-item events 
are considered there. This fact re-enforces our claim that the uniform mining 
approach does not simply unify several existing approaches; it also extends be- 
yond them by allowing the user to define new notions of association. We omit 
the detail of the experiments due to the space limit. 

5 Related Work 

In ID, a language for specifying several pre-determined rules is considered, but 
no mechanism is provided for the user to specify new notions of association. In 
m, it is suggested to query mined rules through an application programming 
interface. In HMI], some SQL-like languages are adopted for mining association 
rules. The expressiveness of these approaches is limited by the extended SQL. 
For example, they cannot specify most of the UDAs in Section 1 and 2. In duni, 
a generic data mining task is defined as finding all patterns from a given pattern 
class that satisfy some interestingness filters, but no concrete language is pro- 
posed for pattern classes and interestingness filters. Finding causal relationships 
is studied in [fil l 4j . None of these works considers the extendibility where the 
user can define a new mining task. 
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6 Conclusion 

This paper introduces the notion of user-defined association mining, i.e., the 
UDA mining, and proposes a specification framework and implementation. The 
purpose is to move towards a unified data mining where the user can mine a 
database with the same ease as querying a database. For the proposed approach 
to work in practice, however, further studies are needed in several areas. Our 
current work has considered only limited syntax constraints on events, and it is 
important to exploit broader classes of syntax constraints to reduce the search 
space. Also, a unified mining algorithm may be inferior to specialized algorithms 
targeted at specific classes of UDAs. It is important to study various optimization 
strategies for typical and expensive building blocks of UDAs. In this paper, we 
have mainly focused on the semantics and “computability” (in no theoretic sense) 
of the specification language. A user specification interface, especially merged 
with SQL, is an interesting topic. 
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Abstract. In the paper we consider the knowledge in the form of association 
rules. The consequents derivable from the given set of association rules 
constitute the theory for this rule set. We apply maximal covering rules as a 
concise representation of the theory. We prove that maximal covering rules 
have precisely computable values of support and confidence, though the theory 
can contain rules for which these values can be only estimated. Efficient 
methods of direct and incremental computation of maximal covering rules are 
offered. 



1 Introduction 

The problem of discovery of strong association rules was introduced in [1] for sales 
transaction database. The association rules identify sets of items that are purchased 
together with other sets of items. In the paper we consider a specific problem of 
mining around rules rather than mining of rules in a database. Let us assume a user is 
not allowed to access the database and can deal only with the restricted number of 
rules provided by somebody else. Still, the user hopes to find new interesting 
relationships. On the other hand, the rules provider should be certain no secret 
patterns will be discovered by the user. Therefore, it is important for the provider to 
be aware of the consequents derivable from the delivered rule set. 

The problem of inducing knowledge from the rule set was addressed first in [5]. 
We offered there how to use the cover operator and extension operator in order to 
augment the given knowledge. Unlike, the extension operator, the cover operator does 
not require any information on statistical importance (support) of rules. The induced 
rules are of the same or higher quality than the original ones. Additionally, it was 
introduced in [5] the notion of maximal covering rules that represent (subsume) the 
set of given and induced rules. 

It was shown in [6] how to induce all knowledge {theory) derivable from the given 
rule set. Theory can contain more rules than those derived by means of cover and 
extension operators. The algorithms for inducing theory as well as for deriving 
maximal covering rules for theory were offered there. Additionally, it was shown how 
to test the consistency of the provided rule set and how to extract its consistent subset. 

In this paper we investigate properties of maximal covering rules in order to 
propose more efficient method of their computing than the one proposed in [6]. In 
addition to efficient direct algorithm of computing maximal covering rules we 
propose an incremental approach. 
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2 Association Rules, Rule Cover, and Maximal Covering Rules 

Let / be a set of items. In general, any set of items is called an itemset. The itemset 
consisting of k items is called k-itemset. Let D be a set of transactions, where each 
transaction T is a subset of I. An association rule is an implication A => L, where 
0 x,Y d and An Y=0. Support of an itemset A is denoted by sup{X) and defined 
as the number (or the percentage) of transactions in D that contain A. Support of the 
association rule A => T is denoted by sup{X => Y) and defined as sup(X u Y). 
Confidence of A => T is denoted by confiX => Y) and defined as sup{X u L) / sup(X). 
The problem of mining association rules is to generate all rules that have sufficient 
support and confidence. In the sequel, the set of all association rules whose support is 
greater than s and confidence is greater than c will be denoted by AR(s,c). 

A notion of a cover operator was introduced in [3] for deriving a set of rules from 
a given association rule without accessing a database. The cover C of the rule A => T, 
was defined as follows: C(A => T) = { AuZ => VI Z,L c T a Zn V = 0 a V^ 0 } . 

Property 1 [3]. Let r: (A=> T) and r’: (A’ => F) be association rules. 

r’ G C(r) iff A’uF c AuT a A’ 3 A. 

Property 2 [4]. Let r’: (A’ => F) belongs to AR{s,c). 

3rGAR(s,c), ntr’ a r’eC(r) iff 3(A=> Y)gAR(s,c), (A’=Aa A’uF c AuF) v 
(A’3Aa A’uF = AuT). 

Property 3 [3]. Let r,r’ gAR{s,c). If r’GC(r), then sup(r’)>sup(r) a conf{R)>conf{r). 

Example 1. Let = {A,B,C,D,E}, = {A,B,C,D,E,F}, = {A,B,C,D,E,H,I}, T, = 

{A,B,E} and = {B,C,D,E,H,I} are the only transactions in the database D. Let 
r:(B=>DE). Then, C(r) = {B => (4,80%), B => D (4,80%), B £ (5,100%), 

BD ^ E (4,100%), BE ^ D (4,80%)}. Clearly, the support and confidence of rules in 
C(r) are not less than the support and confidence of r. 

Maximal covering rules (MCR) for the set of rules R were defined in [5] as 
follows: MCR(R) = { rsR\ 3i r’sR, a re C(r’)j. Whatever can be induced from R 
by the cover operator will be also induced from its subset MCR(R). 

Example 2. Let R = [B^DE, B=>D, B^E, BD=>E, BE^D, B^CDE, 
B => CD, B => CE, BC => DE) be the set of association rules. This set of rules can be 
derived from just one maximal covering rule, namely: MCR{R) = {B => CDE). 



3 Inducing Theory 

In this section we recollect after [6] how to induce the knowledge derivable from R. 
In order to augment the initial knowledge R, it will be used the information on 
supports of itemsets which is available in R. In the sequel, we assume the supports 
and confidences of all rules in R are known. By this assumption, the supports of 
itemsets of these rules as well as the supports of itemsets of the antecedents of these 
rules are also known. The support of the antecedent of a rule rsR is equal to 
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sup(r) / conf(r). Applying this simple observation, the notion of known itemsets for R 
(KIS(R)) was defined in [6] as follows: KIS{R) = {AuH X^YeR] u {XI X^YeR). 

One can easily note that for any itemset X such that there are Y,ZeKIS{R), YcXcZ, 
the support of X can be estimated as follows: sup(Y) > sup{X) > sup(Z). The itemsets 
that can be assessed by employing the knowledge on R are called derivable itemsets 
for R (DIS(R)) and are defined as follows: DIS(R) = {X\ 3Y,ZgKIS(R), ftXcZ}. 
Obviously, DIS(R) 3 KIS(R). Pessimistic support (pSup) and optimistic support 
(oSup) of an itemset XeDIS(R) wrt. R are defined as follows: pSup(X,R) = 
max{sup(Z)\ ZgKIS(R) A X^Z}, oSup(X,R) = min{5M/?(F)l a YcA). 

Clearly, the real support of XgDIS{R) belongs to [pSup{X,R), oSup{X,R)]. In addition, 
sup(X) = pSup(X,R) = oSup(X,R) for XgKIS(R). 

Knowing DIS{R) one can induce (approximate) rules A=>T provided AuT g DIS(R) 
and X G DIS(R). The pessimistic confidence (pConf) of induced rules is defined as 
follows: pConfiX^Y,R) = pSup(XuY,R) / oSup(X,R). The knowledge derivable from 
R is called theory for R (T{R)) and defined as follows: T{R) = {A=>y| X'uY g DIS(R) 
A X G DIS(R)}. It is guaranteed for every rule rGT(R) that its support is not lower 
than pSup{r,R) and its confidence is not lower than pConf{r,R). By T{R,s,c) we denote 
T(R)nAR(s,c). In particular, T{R,0,0) equals to T(R). 



4 Direct Generation of Maximal Covering Rules for Theory 

In this section we consider generation of maximal covering rules for the theory 
T(R,s,c), where is a rule set, is a minimum rule support, and c is a minimum rule 
confidence. Let us start with the property of rules generated from DIS(R): 

Property 4. Let X,ZgDIS(R), X\Z’gKIS(R), pSup(Z,R) = sup(Z’) 

and oSup(X,R) = supfX’). 

a) pSup(X^Z/X,R) = pSup(X’^Z’IX’ ,R) = sup(Z’), 

b) pConf(X^ZIX,R) = pConffX’ ^Z’ IX’ ,R) = conffX’^Z’IX’), 

c) X^Z/X G T(R,s,c) iff A’^ZVA’ e T{R,s,c), 

d) X^Z/X G C(X’^Z’IX’). 

Proof: Ad. b) pConfiX^ZIX,R) = pSup(X^Z/X,R) / oSup(X,R) = I* by Property 4a */ 
= pSup(X’^Z’IX’,R) / sup{X’) = pSup(X’^Z’/X’,R) / oSup(X’,R) 

= pConfiX’^Z’ IX’ ,R) = supiX’^Z’ IX’) / sup(X’) = confiX’^Z’ IX’). 

Observations: 

01. It follows from the definition of DIS(R) and pSup that for every Z in DIS{R) there 
is Z’, Z’3Z, in KIS(R) such that pSup(Z,R) = sup(Z’). Similarly, by definition of 
DIS(R) and oSup, for every A in DIS(R) there is A’, A’cA, in KIS{R) such that 
oSup{X,R) = sup(X’). Hence, and from Property 4 it follows that for every rule in 
T(R,s,c) built from a derivable itemset with the antecedent being a derivable 
itemset there is a covering rule in T{R,s,c) built from a known itemset with the 
antecedent being a known itemset. This implies that no rule built from an itemset 
in DIS(R)\KIS(R) or having antecedent built from an itemset in DIS(R)\KIS(R) is 
a maximal covering rule. Thus, generation of candidate MCR can be restricted to 
generating rules from itemsets in KIS{R) whose antecedents are also built from 




Direct and Incremental Computing of Maximal Covering Rules 403 



itemsets in KIS(R). Since a maximal covering rule is built from a known itemset 
as well as its antecedent, then its support and confidence are also known. 

02. Property 4d implies that no rule X=>Z\X built from ZgKIS{R) is maximal 
covering if there is a proper superset Z’ gKIS(R) of Z having the same pessimistic 
support as Z. 

Observations 01 and 02 are used in the FastGenMCR algorithm that computes 
MCR(T(R,s,c)). Our algorithm is a modification of FastGenMaxCoveringRules, we 
proposed in [6]. The difference between the two algorithm is as follows: 
FastGenMaxCoveringRules builds candidate rules X^ZIX assuming ZgKIS(R) and 
XgDIS(R). FastGenMCR generates candidate rules X=>Z/X assuming not only Z but 
also X belongs to KIS{R). Both algorithms return the same MCR. 

Algorithm. FastGenMCR itemsets with support > s: KIS, min. conf . : c) ; 

{ MCR = 0 ; 

forall .k-itemsets Z g KIS, k > 2, do { 

Z.maxSup = max{{Z'.sup\ Z(zZ’ eKIS) u {o}); 

if Z.sup ^ Z.maxSup then { // Observation 02 

= {{.k}| XeZj ; II create 1-item antecedents 

for (i = 1; {Ai ^ 0) and (i < k) ; i++) do { 
forall itemsets X e Ai C\ KIS do { 
conf = Z.sup / X.sup; 
if conf > c then { 

I * X Z\X is an association rule */ 

if {Z.maxSup / X.sup < c) then 

/* There is no longer assoc, rule X=>Z'\X, Z'~z>Z, that covers X=>Z\X */ 
add X ^ Z\X to MCR-, // Property 2 & Observation 01 

/* Antecedents of association rules are not extended */ 

Ai = Ai \ {X}; }; }; 

■Ai +1 = AprioriGeniAi) ; }; }; }; // compute (i+l)-item antecedents (see [2]) 

return (MCR) ; } 

5 Incremental Generation of Maximal Covering Rules for Theory 

In this section we consider the issue of updating maximal covering rules for the 
theory when additional rules are provided. Let MCR be the maximal covering rules 
for T(R,s,c), where R is an initial rule set, 5 is a minimum required rule support, and c 
is a minimum required rule confidence. Let r: X=> ZXX be the provided rule whose 
support and confidence is known. Then, the set of known itemsets augments by 
itemsets X and Z, which can be used for the construction of new association rules. It 
may happen that some new association rules will be maximal covering. On the other 
hand, the new association rules may invalidate some of maximal covering rules that 
were previously found. Hence, the update of maximal covering rules will consist from 
two steps: 1) generation of new maximal covering rules and 2) validation of old ones. 
The IncrGenMCR algorithm we propose shows how to update maximal covering rules 
MCR for theory T(R,s,c) for each new known itemset with sufficiently high support. 

Algorithm. IncrGenMCR {va.T maximal covering rules for T{R,s,c): MCR, 

var known itemsets with support > s: KIS, new known itemset: X, min. conf.: c) ; 

{if X€KIS then { 

/* Step 1: add new maximal covering rules */ 

AMCR = AddNewMCRiKIS, X, c) ; 

/* Step 2: remove invalidated maximal covering rules */ 

RemoveInvalidatedMCRiMCR, X, c) ; 
add AMCR to MCR; 
add X to KIS;} } 
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5.1 Adding New Maximal Covering Rules 

Algorithm. AddNewMCR {k.no'^n itemsets: KIS, new known itemset: X, min. conf . : c) ; 

{ AMCR = 0 / 

SuperKIS = {proper supersets of X in KIS} ; 

SubKIS = (proper subsets of X in KIS} ; 

X.maxSup = max{[Z' .sup\ Z' eSuperKIS } w {O}); 

X.minSup = min({Z'.sup \ Z’ eSubKIS} w {oo}); 

/* new itemset X will be used as an antecedent of candidate rules */ 

/* KIS will be used as generators of candidate rules */ 

forall itemsets Z g SuperKIS do { 

/* update minSup of superset Z of X for future use */ 

Z.minSup = min{Z.minSup, X.sup); 

if Z.sup ^ Z.maxSup then // Observation 02 

if IsMCRiX ^ Z\X, c) = true then 
add X => Z\X to AMCR', } 

/* X will be used as a generator of candidate rules */ 

/* KIS will be used as antecedents of candidate rules */ 
if X.sup ^ X.maxSup then // Observation 02 

forall itemsets Z e SubKIS, do { 

/* update maxSup of subset Z of X for future use */ 

Z.maxSup = max { Z . maxSup , X.sup); 
if IsMCRiZ ^ X\Z, c) = true then 
add Z => X\Z to AMCR; } 
return (ZlMCR) ; } 

Let A be a new known itemset. The AddNewMCR algorithm works under the 
assumption that for every known itemset Z there is kept an additional information on 
maxSup and minSup, where maxSup is the maximum from the supports of known 
itemsets that are proper supersets of X and minSup is the minimum from the supports 
of known itemsets that are proper subsets of X. At first, AddNewMCR determines 
subsets and supersets of X and computes maxSup and minSup for X. Next, each proper 
superset Z of A is considered as a generator of the candidate rule A => ZVA. 
(According to Observation 02 it makes sense to limit generators to those satisfying 
the condition Z.sup + Z.maxSup.) Similarly, each proper subset Z of A is considered as 
an antecedent of the candidate rule Z => AVZ. (According to Observation 02 it makes 
sense to consider such candidates if X.sup + X.maxSup.) The IsMCR function 
validates candidate rules as being maximal covering in T(R,s,c) or not. 

function JsWCJ? (candidate rule: X ^ Z\X, min. conf.: c) ; 

{conf = Z.sup / X.sup; 
if conf > c then 

/* X Z\X is an association rule */ 
if {Z.maxSup / X.sup < c) 

/* There is no longer assoc, rule X^Z'\X, Z'r>Z, that covers X-=>Z\X */ 
and {Z.sup / X.minSup < c) then 

/*There is no assoc, rule X'^Z\X' with antecedent X' <z.X that covers X=>Z\X */ 

return (true) ; 

/* X ^ Z\X is MCR by Property 2 & Observation 01 */ 

return (false) ; } 

Let A => ZVA be a candidate rule. According to Property 2, the rule is not covered if 
there is no association rule of the form A => Z’VA, Z’3Z, and there is no association 
rule of the form A’ => ZVA’, A 3 A’. Clearly, if Z.maxSup / X.sup < c then there is no 
association rule of the form A => Z’VA, Z’3Z, and if Z.sup / X.minSup < c then there is 
no association rule of the form A’ => ZVA’ , A 3 A’ . In such a case, the candidate rule 
A => ZVA is not covered by any association rules and hence it is maximal covering. 
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5.2 Removing Invalidated Maximal Covering Rules 

Algorithm. RemoveInvalidatedMCR{va.T maximal covering rules: MCR, 
new known itemset: X, min. conf.: c) ; 

{ forall rules Y=>Z\Y g MCR, where the rule generator Z<z.X do { 
if {X.sup / Y.sup > c) then 

/* is an association rule and covers Y=^Z\Y - see Property 2*/ 

remove Y=>Z\Y from MCR; } 

forall rules Y=>Z\Y e MCR, where the rule antecedent Y:dX do { 
if (Z.sup / X.sup > c) then 

/* X-=>Z\X is an association rule and covers Y-^Z\Y - see Property 2*/ 
remove Y^Z\Y from MCR; } } 

The RemoveInvalidatedMCR algorithm applies Property 2 in order to eliminate all 
rules from MCR covered hy association rules which can be built from X. 



5.3 Computational Complexity 

The two steps of updating of maximal covering rules: generating of new maximal 
rules (AddNewMCR) and validating old ones {RemoveInvalidatedMCR) can be 
performed independently. The former one is linear wrt. the number of known itemsets 
KIS and the latter one is linear wrt. the number of maximal covering rules for T(R,s,c). 

6 Conclusions 



In the paper we proved the important property of maximal covering rules for the 
theory that states that both generators and antecedents of maximal covering rules can 
be built only from known itemsets. Hence, we know that support and confidence of 
maximal covering rules can be precisely computed. In order to find maximal covering 
rules for theory we proposed FastGenMCR. In addition, we proposed an incremental 
approach to computing maximal covering rules when the information on new 
rules/itemsets is provided. The offered IncrGenMCR algorithm is linear wrt. the 
number of known itemsets and the number of maximal covering rules. 
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Abstract. The problem that we tackle here is a practical one: When users inter- 
actively mine association rules, it is often the case that they have to continu- 
ously tune two thresholds: minimum support and minimum confidence, which 
describe the users’ changing requirements. In this paper, we present an efficient 
data re-mining (DRM) technique for updating previously discovered association 
rules in light of threshold changes. 



1 Introduction 

Various algorithms have been proposed [1,2,4, 6] to discover frequent item-sets. Gen- 
erally speaking, these algorithms first construct a candidate set of frequent item-sets 
based on certain heuristics, and then discover the subset that indeed contains frequent 
item-sets. This process can be done iteratively in the sense that the frequent item-sets 
discovered at one iteration will be used as the basis for generating the candidate set for 
the next iteration. For example, in [2], at the kth iteration, all frequent item-sets con- 
taining k items, referred to as frequent k-item-sets, are generated. In the next iteration, 
to construct a candidate set of frequent (k-Hl)-item-sets, a heuristic is used to expand 
some frequent k-item-sets into a (k-Hl)-item-set, if certain constraints are satisfied. 

Among all the algorithms proposed, Apriori (and its variants) [2] and DHP [3] al- 
gorithms are most commonly applied. They both run a number of iterations and com- 
pute the frequent item-sets of the same size at each iteration, starting from the size-one 
item-sets. At each iteration, they first construct a set of candidate item-sets and then 
scan the database to count the number of transactions that contain each candidate set. 
The key to optimization lies in the techniques used to create the candidate sets. The 
smaller the number of candidate sets is, the faster the algorithms would be. 

However, very little work has been done on the second problem mentioned earlier. 
A method of handling incremental database updates for the rules discovered by the 
generalization-based approach was briefly discussed in [5]. As related to this problem, 
Lee and Cheung have done some work [7], which focuses primarily on how to update 
association rules when a database is incrementally changed. As in real-world applica- 
tions, users are often unsure about their requirements on the minimum support and 
confidence in the first place. This can be due to the lack of knowledge about the appli- 
cation domains or the outcomes resulting from different threshold settings. As a result, 
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they may be repeatedly unsatisfied with the association rules discovered, and hence 
need to re-execute the mining procedure many times with varied thresholds. In the 
cases where large databases are involved, this could be a time-consuming, trial-and- 
error process, since all the computation done initially in finding the old frequent item- 
sets is wasted and all frequent item- sets have to be re-computed again from scratch. In 
order to deal with this situation, it is both desirable and imperative to develop an effi- 
cient means for re-mining a database under different thresholds in order to obtain 
an acceptable set of association rules. In this paper, we will explicitly address this 
problem and present an efficient algorithm, called Posteriori, for computing the fre- 
quent item-sets under the varied thresholds. 



2 Problem Statement 

2.1 Mining Association Rules 

Let I={ij,ij,...,i_^} be a set of literals, called items. Let D be a set of transactions, where 
each transaction T is a set of items such that Tcl . Associated with each transaction is 
a unique identifier, called its TID. We say that a transaction T contains X, a set of 
some items in I, if XcT. An association rule is an implication of the form X=>Y, 
where Xcl, Yd, and XnY =0. The rule X=>Y holds in the transaction set D with 
confidence c if c% of transactions in D that contain X also contain Y. The rule X=>Y 
has support s in the transaction set D if s% of transaction in D contain XuY. 

2.2 Re-mining Association Rules 

Let L be the set of frequent item-sets in D, s be the minimum support, and IDI be the 
number of transactions in D. Assume that for each XgL, its support count, X. support, 
which is the number of transactions in D containing X, is available. 

After users have found some association rules, they may be unsatisfied with the re- 
sults and want to try out new results with certain changes on thresholds, such as min- 
sup from s to s'. 

Thus, the essence of the problem of re-mining association rules is to find the set L' 
of frequent item-sets under the new thresholds. Note that a frequent item-set in L may 
not be a frequent item-set in L'. On the other hand, an item-set X not in L may become 
a frequent item- set in L'. 



3 Algorithm Posteriori 



The following notations are used in the rest of the paper. L^ is the set of all size-k 
frequent items in D under the support of s%, and L^' is the set of all frequent k-items 
in D under the support of s'%. Q is the set of of size-k candidate sets in the k-th itera- 
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tion of Posteriori. Moreover, X.support represents the support counts of an item-set X 
in D. 

When minsup is changed, two cases may happen: 

1. s'>s, in this case, some original frequent item-sets will become losers, i.e., 
they are no longer frequent under the new threshold minsup. 

2. s'<s, in this case, some original frequent item-sets will become winners, i.e., 
they are included in the frequent L' under the new threshold minsup. 

In the first case, the updating of frequent items is simple and intuitive. Using the 
item support, the algorithm can be stated as follows: 

Algorithm Posteriori A: 

Input: (1) L^: the set of all frequent k-items in D, where k=l,...,r 
(2) s': where s’>s 

Output: L ': the set of all frequent item-sets in D. 

L'=0 /*L': initialized */ 

for (k=l; k<r; k++) do begin 

L|,'={XeL|,IX. supports s') /* put winners in L^' */ 

l'=l'ul; 

return L' 



The correctness of algorithm Posteriori_A can be guaranteed by the following 
lemma: 

Lemma 1: A k-item-set X is in the frequent item-set L^' of database D under s' 
only if X is in the frequent item-set under s, where s'>s 

Proof. Suppose that X is not in the frequent item-set L^, then X.support < s xlDI, 
since s'>s, so X.support< s' xlDI. That is, X is not in the frequent item-set L^'. 

Now, let us concentrate on the second case mentioned above. We will propose an 
efficient algorithm, called Posteriori_B. The framework of Posteriori_B is 
similar to that of Apriori. It contains a number of iterations. The iteration starts from 
the size-one item-sets, and at each iteration, all the frequent item-sets of the same size 
are found. Moreover, the candidate sets at each iteration are generated based on the 
frequent item-sets found at the previous iteration. The features of Posteriori_B 
that distinguish it from Apriori are listed as follows: 

1 . At each iteration, the size-k frequent item-sets in L are updated against the in- 
crement of support to add the winners. 

2. While generating the candidates, a set of candidate sets, Ck, is divided into 
three parts. Each of them is generated separately. 

These features combined together form the core in the design of Posteriori-B and 
make Posteriori a much faster algorithm in comparison with the re-running of Apriori 
on the database. 

The following is a detailed description of the algorithm Posteriori_B. The 
first iteration of Posteriori_B is described, which is followed by the discussion of 
the remaining iterations. 
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3.1 First Iteration: Finding Size-One Winners and Generating Candidate Sets 

The following properties are useful in the derivation of the frequent 1 -item-sets for the 
updated s'. 

Lemma 2: A k-item-set X in the frequent item-set of database D under s is also 
in the frequent item-set under s', where s'<s. 

Proof. Since X is in the frequent item-set L,,, X. support > sxD, so X.support>s'xD. 
That is, X is in the frequent item-set L^'. 

Based on lemma 2, the finding of frequent 1-item-set L/ is only to scan the item- 
set that is not in Lj. If we store the result C, of algorithm Apriori, we need only scan 
C.-L, 

So, the first iteration is very simple, that is to scan Cj-Lj by checking the condition 
X.support > s'xD. As all new winners are found, we call them Ij, thus L|'=L|Ul| 

3.2 Second Iteration and Beyond: Dividing Candidate Sets and Finding 
Re-maining Winners 

The following properties are useful in the derivation of the frequent k-item-sets (where 
k>l) for the updated s'. 

Lemma 3: All the subsets of a frequent item-set must also be frequent. 

Proof. This is the basic property of frequent item-sets, as proven in [2]. 

Lemma 4: A k-item-set {Xj, X^,..., X^} not in the original frequent k-item-set L^ 
can become a winner under the updated s', where s'<s, only if {Xj, X^,..., X_^}. support 
> s'xD. 

Proof. The lemma is derivable from the definitions of minimum support and fre- 
quent k-item-set. 

At the first iteration, as shown above, all the frequent 1 -items are divided into two 
non-intersecting sets: Lj and Ij. Based on Lemma 3, all the subsets of a frequent item- 
set must also be frequent, so, any frequent 1 -item-set corresponds to a single item of 
k-item-sets must be an element of Lj or Ij. According to this, we can divide all fre- 
quent k-item-sets under the new minsup s' into three classes: 

1: for each frequent k-item-set {Xj, X^,..., X^}, Vi(l<i<k), {Xj }gLj; 

2: for each frequent k-item-set {X,, X^,..., X^}, Vi(l<i<k), {Xj }g1j; 

3: for each frequent k-item-set {Xj, X^,..., X^}, Vi(l<i<k), we have two non-empty 
subsets Xj and x^, x,ux 2 ={X,, X^,..., X^}, x^r\x^=0, and XjGL,, x^ g1,. 

So, we have obtained three mutually intersecting subsets, and we call them L^[l], 
L^[2], and L^[3], respectively. Moreover, we can filter original L^ from L^[l], and call 
remaining set 1^, then we have L^[l]=L^ul^, and also we have L/=L^[l]uL^[2]uL^[3], 
LJl]nLJ2]=0, LJl]nL,[3]=0, and L,[2]n L,[3]=0. 
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Based on the above-mentioned division, we can decompose the generation of can- 
didate sets into three parts, i.e., generating C,.[l], Q[2], and C^[3], respectively. 

For Q[l] and Q[2], this is simple; we just apply the Apriori_gen function in algo- 
rithm Apriori: 

C J 1 ] = Apriori_gen(L^.j [ 1 ] )-4 
C J2] = Apriori_gen(L^.j [2] ) 

If we store the result of Q, then C^[1]=C^-L^. The key problem is how to generate 
C^[3]. From the above discussion, we know that items in Q[3] are composed of items 
in L;[l] and L^j[2]. Thus, we can modify the Apriori_gen function into Posteriori_gen 
function. The Posteriori_gen function takes as argument L,[l] and L^ j[2], the set of all 
1 and 2-classes frequent i-item-sets and (k-i)-item-sets. It returns a superset of the set 
of all frequent 3-class k-item-sets. The function works as follows: First, in the join 
step, we join LJl] and L^ j[2]: 
insert into Q[3] 

select p.itemj, p.itemj,..., p.itemj, q.itemj,q.item 2 ,..., q.item^i 
from Li[l],L,.[2]q 

Next, in the prune step, we delete all item-sets cg Q[3] such that some (k-1)- 
subset of c is not in / : 

for all item-sets cg Q[3] do 
for all (k-l)-subsets s of c do 
if (sg /) then 
delete c from C^[3] 

Correctness 

As far as the correctness of our algorithm is concerned, we need to show that 
Since C;=CJl]uCJ2]uCJ3], L;=LJ1 ]uLJ2]uLJ 3], and clearly, 
C^[1]3 Lj^[ 1],C^[2]3L^[2] (as proven in [2]), we only need to show C^[3]3L^[3]. As 
defined earlier, C^[3] is composed of two nonempty subsets Xj and x^, XjUXj={X|, 
X^,..., X^}, XjnXj=0, and XjGL,, x^gIj. Because any subset of a frequent item-set must 
also be a frequent item-set, if we extend each item-set in Xj, x^ with all possible items 
and then delete all those whose (k-l)-subsets are not in L^ /, we would be left with a 
superset of the item-sets in L^[3]. 

The join is equivalent to extending Xj, x^ with each item in the database. Thus, after 
the join step, we can have C^[3]3L^[3]. By similar reasoning, for the prune step, we 
can delete from C^[3] all item-sets whose (k-l)-subsets are not in L^ /, but do not de- 
lete any item-set that could be in C^[3]. 

3.3 The Posteriori_B Algorithm 

Based on the above discussions, we can now formally state the Posteriori_B 
algorithm as follows: 

Algorithm Posteriori_B: An efficient algorithm for re-mining of association rules upon sup- 
port changes. 

Input: (1) L^: the set of all frequent k-items in D under s, where k=l,.. .,r 
(2) s': where s'<s 
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(3) C^: the set of all candidate k-items in D under s, where k=l,...,r 
Output: L': the set of all frequent item-set in D under s'. 
lj={ new frequent 1 -item-sets) 

L,'= L,u f 

for (k=2; 2ik++) do begin 

Q[1]=Q- Lj /* 1 -class k candidates */ 

Q[2]=Apriori_gen(L^j[2]) /* 2-class k candidates */ 

CJ3]=0 

for (j=l; j<k-l;j-H-) do 

Cj[3]=Cj[3]uPosteriori_gen(L,[l].Lj.,[2]) /* 3-class k candidates */ 
for all transactions tsD do begin 
C,[l] =subset(Cj[l],t) /* candidates contained in t */ 

C,[2]=subset(CJ2],t) 

C,[3]=subset(Q[3]„t) 

for all candidates CGC,[l]'-'Ct[2]'-’C,[3] do 
c.support-H- 

end 

l„={ce Q[l]lc.support>s') 

LJ1]=4 u 1, 

LJ2]={cGCj[2]lc.support>s') 

LJ3]={cGC|,[3]lc.support>s') 

4 '= LJl]u4[2]u4[3] 
end 

Answer=u^4 ’ 

4 Concluding Remarks 

In this paper, we presented an efficient data re-mining (DRM) method for discovering 
association rules. In order to assess the efficiency and effectiveness of the Posteriori 
algorithm, we have conducted several experiments and compared its performance with 
that of Apriori. Our experiments were performed on a Pentium III PC. The obtained 
results have shown that Posteriori is much faster than the presently most popular min- 
ing algorithm. Furthermore, Posteriori performs 2~6 times faster than Apriori for a 
moderate size database of 100,000 transactions. 
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Abstract. Association rule discovery techniques have gradually been 
adapt-ed to parallel systems in order to take advantage of the higher 
speed and greater storage capacity that they offer. The transition 
to a distributed memory system requires the partitioning of the 
database among the processors, a procedure that is generally carried 
out indiscriminately. However, for some techniques the nature of the 
database partitioning can have a pronounced impact on execution time 
and attention will be focused on one such algorithm. Fast Parallel 
Mining (FPM). A new algorithm. Data Allocation Algorithm (DA A), is 
presented that uses Principal Component Analysis to improve the data 
distribution prior to FPM. 

Keywords: Data Mining, Association Rules, Parallel Algorithms, Data 
Partitioning, Candidate Sets 



1 Introduction 

The discovery of association rules is an important example of data mining 
and we consider one particular parallel algorithm. Fast Parallel Mining (FPM) 
1^. The performance of FPM improves as the inter-processor record pattern 
variation (data skewness, defined formally in rises as this can reduce the 
number of duplicate database operations that are necessary. This observation 
provides the opportunity for improving the performance of FPM by determining 
how the data should be distributed over the processors prior to its application. 
To achieve this, a method that provides good predictions of attribute patterns 
within the database must be employed. Our work investigates the application 
of Principal Component Analysis (PCA) to guide the allocation of records 
in equal numbers to a set of processors in order to maximise variance between 
itemset supports m at each before applying FPM. 

2 Proposed Method for Record Distribution 

Levels of variation between itemset support counts at any one processor are 
particularly influential on the numbers of candidate itemsets generated ^ and 
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provide the motivation for using PCA. Although the purpose of this work was to 
investigate the feasibility of record redistribution prior to FPM, such a technique 
should clearly add as little computation as possible to that of FPM. PCA 0 is 
well recognised as a preprocessing tool in data analysis and its computational 
cost is linear with the number of records in the database. It has been used effi- 
ciently to handle very large databases |7| and has been adapted to a distributed 
memory machine, for example with PARPACK, j7j. 

Data Allocation Algorithm (DAA) uses PCA to redistribute a given database 
as follows: 



1. Find (or use statistical sampling to estimate) the variance/covariance matrix 
for the attribute column means in the database (referred to below as Z). 

2. Apply PCA to Z, giving a matrix of weights. For example, Table[Dshows the 
Principal Components (PCs) for the 10 attributes of a binary database. The 
PCs are represented by the weights for each attribute, shown in descending 
order in the table columns. Each of these PCs is itself given a weight in the 
form of its eigenvalue - the 10 eigenvalues for the PCs in Tableware: 0.1751, 
0.2007, 0.2707, 0.2182, 0.2374, 0.2480, 0.2508, 0.2464, 0.2407, 0.2291. 

Table 1. Column PCs for sample data 





PCI 


PC2 


PCS 


PC4 


PC5 


PC6 


PC7 


PC8 


PC9 


PCIO 


A 


0.896 


0.266 


0.307 


-0.122 


-0.076 


-0.079 


-0.042 


0.056 


-0.020 


-0.011 


B 


-0.405 


0.808 


0.348 


-0.207 


-0.113 


-0.061 


-0.036 


0.027 


-0.006 


-0.034 


C 


-0.128 


-0.459 


0.378 


-0.763 


-0.130 


-0.124 


-0.070 


0.063 


-0.049 


-0.068 


D 


-0.075 


-0.148 


0.352 


0.317 


-0.449 


-0.080 


0.029 


0.022 


-0.239 


0.694 


E 


-0.047 


-0.084 


0.319 


0.154 


0.403 


-0.350 


-0.228 


-0.101 


0.687 


0.214 


F 


-0.011 


-0.049 


0.260 


0.023 


0.056 


0.807 


0.058 


0.417 


0.306 


0.066 


G 


0.003 


0.004 


0.018 


-0.041 


0.020 


-0.247 


0.928 


0.193 


0.195 


0.017 


H 


-0.017 


0.044 


-0.267 


-0.017 


0.020 


-0.332 


-0.257 


0.860 


-0.038 


0.096 


I 


0.049 


0.056 


-0.373 


-0.152 


-0.693 


0.013 


-0.082 


-0.102 


0.578 


0.034 


J 


0.075 


0.174 


-0.373 


-0.458 


0.341 


0.152 


0.045 


-0.145 


-0.037 


0.672 



3. Choose the required number of processors, S, where S < #attributes. The 
PCs are arranged in descending order according to the value of their eigen- 
value and each processor is associated with one of the ’strongest’ S PCs. If 
5=4 then the PCs to be considered from Tabled (in descending order of 
eigenvalue size) would be 3, 7, 6 and 8. 

4. Each PC will contain a group of attributes that are particularly influential 
in its construction; a number of methods can be used for their identification 
13 and we have found graphical approaches the most useful. In order to work 
with all four PCs it is necessary to reach a common number of prominent 
attributes and the average is taken over all PCs, giving 6 in this case. 
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5. Each of the S processors is associated with one of the ’strongest’ PCs, i.e. 
those with the largest eigenvalues. In order to maximise variance the promi- 
nent attributes of each PC are used to write rules that determine which 
records will be located at which processor. For example, the ’strongest’ 6 
attributes for processor 1 are the 3rd, 10th, 9th, 4th, 2nd and 5th. Intra- 
processor attribute variance can be maximised by concentrating records con- 
taining these attributes at one processor while at the same time attempting 
to eliminate the support of one or more others. The balance between at- 
tribute supports to be concentrated and those to be eliminated depends on 
the level of attribute supports and covariances; with high supports and low 
covariance the ability to eliminate records containing groups of attributes 
from a processor would be hindered. The choice of attributes for elimina- 
tion is also an issue - the PCA process tends to be disturbed if an attribute 
with a relatively high weight is chosen but little impact is generally made 
if an attribute with a very low weight is used. Trials have shown that the 
prominent attribute with the lowest weight gives the best results. The iden- 
tified prominent attributes for each processor are used to construct rules for 
record acceptance. Assuming high attribute supports and low covariance in 
the above example it is unlikely that the support of more than one attribute 
could be eliminated from each processor. The rules for processor 1 (using 
the 6 prominent attributes) would be as follows: 

a) add any record with a 1 for attribute 3 and a 0 for attribute 5 

b) add any record with a 1 for attribute 10 and a 0 for attribute 5 

c) add any record with a 1 for attribute 9 and a 0 for attribute 5 

d) add any record with a 1 for attribute 4 and a 0 for attribute 5 

e) add any record with a 1 for attribute 2 and a 0 for attribute 5 

f) add any record with a 0 for attribute 5 

g) add any record 

The rules for the other 3 processors are constructed in an identical manner. 
Sometimes clashes will occur as to which 1-itemset supports are to be elim- 
inated; if this is the case the ’weaker’ PCs are always given priority over the 
’stronger’. 

6. Records are taken one at a time from their centralised database and at- 
tempts are made to match them against the rules at each processor. The 
processor attached to the weakest of the S PCs is considered first (in this 
case processor 4) in order to improve load balancing. If a record satisfies rule 
1 for processor 4 then it is stored at processor 4. If not, the first rule for 
processor 3 is considered and the record is placed at processor 3 if there is a 
match. Otherwise the first rule for processor 2 is considered. Once the first 
rules for each processor have been tried with no match the second rules are 
selected in the same order. The rules are chosen in rotation in this manner 
until there are no records left to distribute. As the last rule for each proces- 
sor is always to ’add any record’ each record will always have a destination. 
When a processor becomes full (i.e. reaches its quota of records) its rules are 
withdrawn from the matching process and only those from processors with 
space still to fill are considered. 
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7. During the first pass of FPM records can be selected from the centralised 
database and a destination processor determined for them by applying DAA. 
Once this is complete the support contribution of the record for that pro- 
cessor can be determined before the record is placed in its new location. 
At the end of pass 1 the data will have been redistributed and the support 
counts for the candidate 1-itemsets at each processor accumulated. The FPM 
algorithm can then proceed as before. 

3 Data Preparation 

With the aim of predicting the outcome of applying DAA, data was generated 
with two fixed parameters (data sparsity and mean-skewness), both of which 
can be estimated without pre-empting the results of FPM. Data sparsity refers 
to the percentage of Os in the database - a dense database is a term for one with 
low sparsity. Mean-skewness is calculated by first measuring attribute column 
means over four equal-sized data partitions (as if the data had been divided in 
equal portions over four processors) and then applying the skewness metric in 

El' 

A procedure for generating datasets of fixed sparsity and mean-skewness was 
developed in which each attribute of each record was generated using a series 
of probabilities of being either 0 or 1. These probabilities were adjusted until 
the required levels of sparsity and mean-skewness were achieved. The datasets 
were given the label ’(mean-skewness)_(sparsity)’, e.g. 0.0_80 data refers to data 
with mean-skewness 0.0 and sparsity 80%. DAA was applied to each of these 
datasets. The highest number of candidates generated at any one processor was 
recorded and candidate set numbers were measured according to a selection of 
10 support thresholds for each dataset 

For comparison purposes each dataset was also divided across the processors 
in equal sized chunks (without DAA) so as to create what will be referred to 
as the raw data distributions. FPM was then applied and the candidate sets 
recorded in the same manner. DAA was kept separate from FPM in these ex- 
periments so that the performance of FPM on the two groups of datasets could 
be measured directly. 



4 Results 

The performance studies were carried out on an IBM SP2 with 2 ’high’ nodes 
running the AIX operation system (version 4.1). Each node contained 4 CPUs, 
a POWERPC 604 processor, a clock speed of 100 MHz, memory of 256MB and 
a 7133-500 DASD disk (5 4.5 GB drives). The Parallel Virtual Machine (PVM) 
system was used to view nodes as a single parallel virtual machine |S| and FPM 
was written in JAVA and JPVM. 

The number of passes of FPM over the database vary from 2 for datasets 
with 85% sparsity to 10 for those with 25% and it is not easy to compare the 
effect of DAA. The second pass often has the highest computational cost of all 
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passes 0 and it is therefore particularly important to focus attention on the 
effect of DAA over the first two passes of FPM. Execution times were taken for 
the first two passes over 10 support counts across 4 processors and the average 
percentage reduction in execution time for each pass is shown in Table |21 The 
average results hide significant variation; for example, improvements of up to 
43.3% are made for the 0.0_80 data and up to 16.9% for the 0.0_50 data. 



Table 2. Average percentage reduction in execution time 



s 


y 


85% 


80% 


75% 


70% 


65% 


60% 


55% 


50% 


45% 


40% 


35% 


30% 


25% 


0.0 


14.6 


13.5 


10.9 


11.2 


7.4 


7.8 


4.2 


6.2 


5.2 


3.0 


5.1 


1.6 


1.5 


0.02 


5.6 


6.2 


6.7 


2.8 


5.2 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.04 


4.2 


6.2 


6.0 


5.8 


2.5 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.06 


6.2 


7.5 


8.2 


5.9 


9.3 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.08 


8.0 


8.6 


8.7 


5.9 


0.7 


7.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


1.0 


4.2 


11.2 


8.5 


7.4 


9.5 


4.4 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


2.0 


4.0 


4.1 


12.3 


13.8 


12.8 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


3.0 


11.2 


10.1 


12.7 


8.3 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 



Three main observations can be made: 

1. The higher the sparsity of the dataset the better DAA performs. Lower 
sparsity results in higher covariance levels between attributes which hinders 
the ability of DAA to distribute records. 

2. The performance of DAA deteriorates after a sparsity level of 25% and this 
relates to the rule construction of DAA. Each set of rules aims to maximise 
the occurrence of a set of attributes within a specific data partition and to 
minimise the occurrence of one or more others. No direct attention is given 
to any other attribute, the behaviour being left to the PCA mechanism. An 
attribute that is unregulated at one processor but is maximised at another 
will only have a negative impact on DAA if it has significantly more support 
than is possible to store at one data partition. Average levels of support can 
be estimated by considering the sparsity level of the data and the number of 
processors. For example, with 25% sparsity and 4 processors the average sup- 
port level will be equal to the number of records at any given data partition 
- if data sparsity rises higher than this level then DAA will have increasing 
difficulty in controlling the support count patterns across the processors. 

3. The performance for DAA within each fixed band of sparsity varies con- 
siderably. This can be explained by looking at the means and the vari- 
ance/covariance matrix of the dataset in question. If a dataset has a few 
attributes with relatively high support which have relatively high covari- 
ance levels between each other then this will severely impede the ability of 
DAA to distribute the data efficiently. The execution results for the first line 
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of Table 0 are particularly high and each dataset has a very uniform vari- 
ance/covariance matrix - if the relationship between attributes is uniform it 
is far easier for DAA to place records so that itemset generation is balanced 
across the processors. 




Number of Processors Number of Processors 



Fig. 1. Performance curves for 0.0_80 and 0.0_50 data 



Execution time was also measured over 2 and 6 processors and Figure0shows 
performance curves for datasets 0.0_80, 0.0_50, 0.02_70 and 0.2_70. Performance 
using p processors is defined as 1/Tp where Tp is the time to execute the parallel 
code with p processors. A reasonable measure of good performance is one where 
performance increases in accordance with the number of processors being used, 
i.e. p/Ti, and the straight line in Figure [D represents the (naive) ideal situation. 

DAA/FPM outperforms FPM for the 0.0_50 and the 0.2_70 data and scales 
well as the number of processors rise. The time saved by scanning database 
partitions in parallel has more impact as the datasets become denser and more 
heavily populated with Is. With the 0.0_80 data no impact is made by raising 
the number of processors from 4 to 6 whereas the performance of the 0.0_50 data 
scales well for all processor levels considered. 
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The benefits of faster data scanning need to be weighed up against the extra 
cost in message passing brought about by the addition of further processors. 
Messages for FPM are O(n^), where n is the number of processors. If very little 
improvement is gained in the performance of data scanning by partitioning the 
data over larger numbers of processors the increased numbers of messages will 
have a considerable impact. 

The 0.0_80 performance curve lies closer to the ideal line that that of the 

0. 0.50 data. DAA performs more efficiently on data with high sparsity as the 
covariance between attributes is lower. The rule structure of DAA also means 
that its performance worsens as the density of the data rises, as explained above. 
Although DAA is more effective when applied to the 0.0_80 data than the 0.0_50 
in terms of increasing the level of performance over FPM, the impact of raising 
the number of processors has greater significance for the 0.0-50 data due to the 
reduction in database scanning overheads. These results suggest that the 0.0-50 
data will continue to scale well as processor numbers rise but that the 0.0-80 
data will gradually move towards the ’old’ performance line. 

Datasets 0.02-70 and 0.2-70 both have 70% sparsity and both performance 
curves have a positive gradient. However, when DAA/FPM is applied to the 
former dataset there is no significant change in performance when compared to 
the application of FPM. The explanations for these characteristics lie in the load 
balancing of both the data and the candidate set numbers. 

5 Conclusion 

DAA has been shown to make significant improvements to the performance 
of FPM, particularly for data of high sparsity. For dense data the vari- 
ance/covariance matrix should be checked before applying DAA as DAA is ad- 
versely affected by attributes with high means and covariances. Further work is 
required to test DAA on larger datasets and greater numbers of processors. Its 
impact on other parallel association algorithms would also be of interest. Careful 
measurement of the execution overheads of applying DAA and the effectiveness 
of sampling for PCA need to be rigorously investigated. 
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Abstract. Inclusion of domain knowledge in a process of knowledge discovery 
in databases is a complex but very important part of successful knowledge 
discovery solutions. In real-life data mining development, non-structured 
domain knowledge involvement in the data preparation phase and in the final 
interpretation/evaluation phase tends to dominate. This paper presents an 
experiment of direct domain knowledge integration in the algorithm that will 
search for interesting patterns in the data. In the context of stock market 
prediction work, a recent rule induction algorithm, PA3, was adapted to include 
domain theories directly in the internal rule development. Tests performed over 
several Portuguese stocks show a significant increase in prediction performance 
over the same process using the standard version of PA3. We believe that a 
similar methodology can be applied to other symbolic induction algorithms and 
in other working domains to improve the efficiency of prediction (or 
classification) in knowledge-intensive data mining tasks. 



1 Introduction 

In most cases, the availability and the efficient use of Domain Knowledge (DK) 
during the development process of a Knowledge Discovery in Databases (KDD) 
system is essential for successful knowledge discovery. In fact, DK is needed for 
almost any practical knowledge discovery task, independently of the domain or of the 
data mining techniques used, since, at least, some form of DK must be involved in the 
problem definition, in the data preparation and in the results evaluation and utilization 
phases. Sometimes, however, the involvement of DK In the process does not result in 
all the advantages it could bring. In fact, in some real-life situations where KDD 
could be useful, the available formally specified DK is restricted to description or 
definition of data and other forms of DK (for example theories about the way domain 
variables interact) exist only in informal, sometimes uncertain, non-structured forms. 
This kind of limitation of previously existing DK, together with a somewhat scarce 
theoretical work on the topic, usually results in no deliberate involvement of existing 
DK in the specific data mining phase of many real-life KDD processes. 

DK involvement in the data mining step of a KDD process always implies a 
conditioning of the search of hypotheses conducted by the data mining algorithm. 
This conditioning can operate through an “initialization bias” (introducing starting 
conditions for the search), or through a “search bias” (distorting the search space, or 
the evaluation of hypotheses) [14], [15]. 
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DK can be included in the data mining phase through direct integration (implicit or 
explicit) in the data mining algorithm, or through an associated knowledge base. In 
the first case, specific changes to the core data mining algorithm must be performed, 
in order to directly represent the involved domain knowledge through a biasing of the 
search. In the latter case, a very tight coupling between the domain theory description 
in the knowledge base and the bias representation language accepted by the learner is 
need, eventually involving an intermediate knowledge “translator” [3]. Anyway, both 
of these forms of DK integration tend to need software specifically adapted for each 
application case, since different kinds of domain knowledge usually involve different 
representations, and most data mining algorithms (and commercial data mining 
programs) don’t allow the integration any form of DK not contained in the data. 

Direct integration of DK in data mining software generally intends to direct and 
focus the pattern search that takes place at that KDD step. This can raise another 
potential limitation of this technique: If badly directed, the focused search can miss 
some of the potentially interesting patterns that an unbiased search could find in the 
data [4]. However, in spite of the limitations and potential problems, we believe that, 
in some cases, careful DK integration in the data mining step of a KDD process can 
produce significant improvements in the overall efficiency of the process. 

This paper presents an experiment that integrates two domain theories directly in a 
rule induction data mining algorithm. The domain is short-term stock market 
prediction, and the two theories bias the algorithm, during rule search, against a 
specific class of rules, and towards another. The theories are tested over five data sets 
that correspond to multivariate information based on daily quotes of five of the most 
significant stocks in the Portuguese BVLP stock exchange. The base rule induction 
algorithm used, PA3 [1], is a recent general-purpose sequential cover algorithm that 
combines general-to-specific and specific-to-general search to develop each rule. 



2 Domain Knowledge 

Adopting a restrictive DK definition, we will be interested only in domain theories 
that explain or predict future behavior of stocks on the basis of known data. This kind 
of domain theory is extremely uncertain in stock market prediction. There are, 
basically, three different positions: Those who believe that the markets are highly 
efficient and, as a result, essentially unpredictable, those who advocate “fundamental 
analysis” of the business results of the quoted companies, and those who believe that 
“technical analysis” (the analysis of historical stock quotes data, isolated of other 
known facts) is enough to predict the future behavior of those stocks [6] . 

The “efficient market” hypothesis, at least in its weakest form, has been 
traditionally accepted in some academic circles as basically correct, and if that were 
really the case, any effort to predict future behavior of listed stocks would be futile. 
However, besides the firm belief of those who really invest in stock markets (most of 
the investors and all the speculators), there is a growing body of published research 
indicating that at least some markets exhibit imperfections (which translate to a 
degree of predict! vity) [7], [16], [11]. 

Classic “fundamental analysis” has solid background theory but even when 
successful in the long term, is not very useful to predict short-term movements of 




Direct Domain Knowledge Inclusion in the PA3 Rule Induction Algorithm 423 



stock values [7]. A marginal aspect related to fundamental analysis that can be linked 
to very important fast movements of stock prices is the announcement of surprising 
fundamental company information (or surprising macroeconomic information, 
relevant for the whole market). However, this kind of fast readjustment of 
fundamental expectations will not be explicitly integrated in the analysis conducted in 
this paper, since it does not seem relevant for the paper’s objectives and it requires 
very complex base data, and very demanding data preparation. 

The theory behind present “technical analysis” is abundant. Unfortunately it is also 
fragmented and many times of dubious quality, most of it corresponding to unproved, 
sometimes untested, hypotheses. Moreover, the fact that technical analysis theory is 
still not seriously established can hide a fundamental problem: Even if technical 
analysis is realistically possible, perhaps it cannot be generalized for different 
markets, or for different stocks and different time frames of a market. 



3 The Problem and the Data 

The work we are involved in aims to predict the future behavior of five stocks listed 
in the Portuguese BVL stock exchange, utilizing historical data and DK. 

This paper describes work done on direct domain theory integration in a rule 
induction algorithm used for the prediction of the next day behavior of each stock 
(binary prediction of rise or fall). This kind of next-day prediction is not enough to 
develop an operational trading strategy, but it is frequently found in the literature [2], 
[9], [11], and seems adequate to test the validity of the two domain theories involved. 

For this very short-term prediction task, we simplified the base data by omitting 
fundamental information (and by not accounting for dividend payments), and used 
only historical stock quotes, transaction volumes and index values. It should be 
noticed that this base data has low information content for the prediction task, and 
could never result in very high accuracy rates, even with ideal data preparation and 
data mining steps. This situation is similar to having very noisy data both for learning 
and testing, and tends to present overfitting problems during the data mining process. 
With this problem in mind, we selected the domain theories to integrate in the rule- 
induction software aiming to reduce overfitting of the training data. 

The five companies chosen for prediction are among those more actively traded in 
the BVL stock exchange: BCP, Brisa, Cimpor, EDP and PT. For each of the 4 
companies excluding Brisa, daily data from 3-Nov-1997 to 29-Oct-1999 were 
available. For Brisa, quotation in BVL only started in 25-Nov-1997, and so available 
data starts in 25-Nov-1997 and also ends in 29-Oct-1999. Each of the resulting 495 
records (479 for Brisa) includes the day’s date, the closing value of the stock 
exchange main index (BVL30), the number of shares traded, and the opening, 
maximum, minimum and closing values of the stock. 

From each companies’ base data we constructed 15 daily-based “technical 
indicators” to be used as features to mine. These features are functions of the base 
data variables and summarize relations extracted from the previous 10 days of base 
data. As an example, one of the features expresses the relation between the 10-day 
and 3-day weighted moving averages of daily “reference values” (average of 
maximum, minimum and closing prices). Some of these features are categorical. 
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while the others have integer or real values. However, the data mining algorithm 
requires discrete values, so we converted the original values of the features to discrete 
integer values ranging from 1 to 5 - the categorical features resulting in unordered 
sets of these values, and the numerical features resulting in ordered sets. As an 
example, the described relation between the 10-day and 3-day moving averages 
results in an ordered-value feature that is discretized the following way: 

If (0.96 > (MA(10-day)/MA(3-day))) then the feature value is 1; 

If (0.99 > (MA(10-day)/MA(3-day)) > 0.96) then the feature value is 2; 

If (1.01 > (MA(10-day)/MA(3-day)) > 0.99) then the feature value is 3; 

If (1.04 > (MA(10-day)/MA(3-day)) > 1.01) then the feature value is 4; 

If ((MA(10-day)/MA(3-day)) > 1.04) then the feature value is 5. 

The developed features were then subjected to a selection process to reduce their 
number to 10. This limitation on the number of features is introduced to help to 
reduce overfitting problems due to the scarce number of examples available in 
relation to the “descriptive power” of the full set of features. To select the 10 features 
to retain we applied (over the learning examples) a combination of methods including 
(with a heavier weight) Hong’s feature selection method [8] and also (with reduced 
weights) a measure of correlation between the feature value and the result to predict, 
and the simple information gain of the feature. 

The final format of each prepared example consists of 10 decision features with 5 
discrete values (classified as ordered or unordered) and one binary result attribute. 
The result attribute indicates, for each example, if the described “reference value” of 
the stock raises or falls in the next trading day. The total number of examples 
available for each stock is 478 (462 for Brisa). This number is smaller than the 
number of days in the original data mainly because several of the first days must be 
used to construct some of the features of the first example. 



4 The PA3 Rule Induction Algorithm 

The rule induction algorithm we used, called PA3, is a recent general-purpose 
sequential cover algorithm [1]. The main features of PA3 include: 

• A rule evaluation function that integrates explicit evaluations for rule accuracy, 
coverage and simplicity 

• A rule generalization step that is run immediately after each rule is developed in 
an initial general-to-specific development phase 

• A last rule filtering step that allows a choice of the tradeoff level between the 
accuracy and the global coverage of the final rule list. 

The rule evaluation function is 

V = X + xs , 

where v is the rule value, a is the rule accuracy over the learning examples, c is the 
rule coverage, s is the rule simplicity and P and x are constants that must be chosen 
according to the learning data characteristics {p regulates the relative importance of 
rule coverage and rule accuracy and regulates the importance of rule simplicity). 




Direct Domain Knowledge Inclusion in the PA3 Rule Induction Algorithm 425 



This evaluation function is used to direct the search and to choose among 
alternative rules during the initial general-to-specific rule development and also, in 
the following rule generalization step, to evaluate and choose possible generalizations 
of the rules that result from the initial general-to-specific development. In this 
generalization step the evaluation function of the standard PAS is used with the same 
parameter values used in the general-to-specific rule development. This way, the 
algorithm only replaces a rule previously found by a more general version of that 
same rule if the latter is better according to the same evaluation measure. 

PAS induces an ordered list of “if. ..then...” rules. Each rule has the form “if 
<complex> then predict <class>”, where <complex> is a conjunct of feature tests, the 
“selectors”. In PAS, each selector implies testing a feature to see if its value is 
included in a specified range of values. So, each selector indicates the feature to be 
tested and the (inclusive) upper and lower limits of the range of values it has to be 
tested against. The postcondition of a PAS rule is a single Boolean value that specifies 
the class that rule predicts for the cases that comply with all the selectors. It should be 
noted that, while a single PAS rule includes a simple conjunction of tests, the final 
rule set is equivalent to a DNF formula. 

PAS’s last step uses a simple rule evaluation metric (different from the one used in 
the rule learning process) to filter the complete list of the induced rules, retaining only 
a reduced number of stronger rules. Since the rules learned by this algorithm form an 
ordered list, this rule filtering has to retain a set of the first contiguous rules (also 
maintaining the order of those rules). This filtering process is controlled by a user- 
defined parameter that must be set between 0 (to accept all the discovered rules) and 
close to 1 (to accept only the first, stronger, rules). Globally, this rule filtering method 
allows the user to choose the tradeoff level between a more complete case-space 
coverage and a reduced coverage using only the stronger rules (and therefore with 
greater accuracy). 



5 Domain Knowledge Inclusion in PA3 

Our global KDD process allows the integration and testing of domain theories of the 
“technical analysis” kind through a very simple process: They can be represented by 
the features generated from the original data. With this in mind, the theories that seem 
more useful when integrated at the rule induction algorithm level are “meta-theories” 
that can be globally applicable to the rules (in fact, combinations of “technical 
indicators”) created by the rule induction algorithm from the data features. Since, in 
our domain, the relevant information present in the base data is almost completely 
“drowned” in noise, and overfitting tends to occur, we felt that the “meta-theories” to 
test should preferably be chosen to reduce overfitting. 

One of the two theories we decided to test biases the learner against the selection 
of rules belonging to a particular class, while the other intends to promote rule 
generalization for another class of “marginal” rules. More specifically, the first theory 
states that a good rule should not include a test over an ordered-value feature that only 
accepts its middle value (3, since the range of possible values is 1 to 5), since that 
kind of “neutral” value for an ordered-value feature probably does not point strongly 
to clear changes in the stock value. To integrate this theory in the PAS rule induction 




426 P. de Almeida 



algorithm, we altered the evaluation of the basic (still unexpanded) rules: When, 
during the rule induction procedure, a rule has a selector involving an ordered-value 
feature with a value of 3, the standard evaluation result for that rule is multiplied by a 
constant (named modi) with a positive real value smaller than 1, thus reducing the 
rule evaluation result. The second theory states that if a rule includes a test over a 
feature that has ordered values, and a value of 2 or 4 is accepted for that feature, then 
the corresponding “extreme value” (1 or 5 respectively) should also be accepted. The 
reasoning is that if a “strong” (high or low) value for a technical indicator seems to be 
predictive for the future behavior of a stock, then an even stronger (in the same 
direction) value for the that indicator should, most of the time, also point to the same 
prediction. To integrate this theory in the PA3 algorithm, we altered the evaluation of 
the rule expansions: When, during the expansion procedure, a rule has a selector 
(involving an ordered-value feature) that is expanded from a value of 2 or 4 to include 
(respectively) the extreme values of 1 or 5, the standard evaluation result is increased 
through multiplication by constant (named modZ) with a real value greater than 1. 

The general idea behind this use of uncertain DK at the rule induction level is that 
if the theories are globally true, then the rules that do not agree with them have a 
greater probability of corresponding to statistic fluctuations found in the learning data, 
and not to stable patterns useful for out of sample prediction. This problem is 
originated by the noisy data and small learning set sizes and by the very large domain 
space searched. Introducing a small handicap in the evaluations of key rule classes 
ensures that the rules belonging to these classes that are present in the final rule list 
must correspond to patterns in the learning data with above-average “strength”. Of 
course, if the theories are globally true, they should increase the out-of-sample 
accuracy of the predictions. If they are globally wrong, the out-of-sample predictions 
should present a reduced accuracy. 

It is clear that increasing the number of learning examples reduces the advantages 
of integrating this kind of DK to focus the search, since, with a greater number of 
learning examples, the real patterns in the training data tend to be less obscured by 
noise. A marginal point to notice is that this biasing of the search will, of course, 
always reduce accuracy over the learning data. 



6 Tests 

Testing the integration of the domain theories over the available examples is not 
straightforward, because some characteristics of the domain and of the data limit the 
direct use of normal bootstrap or resampling methods. 

In fact, the time series we intend to predict are far from deterministic, and their 
behavior can be expected to change over time due to changes in the underlying 
domain mechanics. This way, a prediction model that proves accurate during a certain 
time span can be expected to (progressively or suddenly) loose prediction accuracy in 
the future. This means that maintaining the temporal order of the examples is 
important if each test example prediction is expected to represent the real prediction 
setup at the time of that example. (As an example, consider the use of training 
examples immediately posterior to the test example being predicted: That corresponds 
to the use of context information that could not be available if the prediction of that 
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example was required in a realistic situation, and can be expected to adjust the 
prediction model to the near-future domain behavior, artificially increasing the 
prediction accuracy). 

This way, since the examples are “time stamped” and the domain behavior is 
expected to vary over time, we opted for the standard time-sequenced division of the 
examples, instead of a classic bootstrap or resampling method. To ensure unbiased 
test results, the available examples were divided into separate learning/validation and 
test sets. We used the first 300 examples (284 for Brisa) from each stock for learning 
and parameter selection and kept apart the last 178 examples from each set for testing. 

To determine the best values for modi and modi, the first 200 of the 300 examples 
(184 of 284 for Brisa) were used for learning with different values for modi and 
modi, and the resulting rule sets were tested on the remaining 100 examples from the 
learning sets. The test results were averaged over the five stocks, and the best global 
values for modi and modi were selected. 

Those values were then used to develop rule lists from the complete sets of 300 
learning examples (284 for Brisa), and the prediction accuracy of those rule lists (over 
the test sets of 178 examples) was compared with the one achieved by rule lists 
obtained using the standard, unbiased, PA3 (modi = modi =1). 

PA3 uses 3 internal parameters: 

• P and X used to regulate rule evaluation during the general-to-specific and 
specific -to-general rule development phases, and must be set considering the 
domain characteristics 

• The final rule filtering parameter must be chosen according to the users desired 
tradeoff level between prediction precision and model coverage. 

Since our aim with these tests is not to achieve the best possible prediction results, 
but to compare the results with and without the integration of the domain theories, we 
chose to simplify our test procedure setting, from the start, the P and parameters to 
the “standard” values of 0.8 and 0.01 [1], instead of optimizing them through tests 
over the training/validation data. Also to simplify the test procedure, the final rule 
filtering parameter was set to prevent any rule filtering, and a default rale was added 
to the end of each learned (ordered) rale set. This way, every learned model is 
guaranteed to produce a prediction for every possible test case. 

During the initial test phase, to determine the best values for the two theories 
parameters, 7 values were tried for the modi parameter (0.4, 0.5, 0.6, ...,0.9, 1.0) and 
11 values were tried for modi (1.0, 1.1, 1.2, ..., 1.9, 2.0). The results for each modi 
value were obtained as an average over the modi values and vice-versa. In all, 1 1 runs 
of the induction algorithm (over each of the 5 examples sets) are averaged to obtain 
each of the accuracy values for modi and 7 runs (also over each of the 5 examples 
sets) are done for each of the accuracy values for modi. 

This test procedure does not try to optimize the modi and modi parameters for 
each stock. Naturally, each of the tested theories can present a different behavior over 
each stock involved in the study, and better final accuracy values could be expected if 
individual modi and modi values were used for each stock. However, considering the 
small number of examples available for each stock, we opted to use accuracy values 
averaged over the five sets, in order to obtain more robust values for modi and modi'. 
This way, the chosen values are those that resulted in the global best results across the 
5 stocks. 
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The average accuracy (as tested over the last 100 learning examples) of the rule 
sets learned over the first 200 (184 for Brisa) examples of each stock set is shown in 
percentage in Figure 1 for the tested values of modi and mod2. 




mod 1 




1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 
mod 2 

Fig. 1. Accuracy (in %) over the last 100 learning examples (averaged over the 5 stock sets) 

As can be seen from the charts in Figure 1, for some of the tested modifier values 
both theories produce an improvement over the standard PA3 (modl=mod2=l). 
However, the lack of regularity of the second theory chart contrasts with the “well 
behaved” first theory chart. In fact, in this first test, the second theory achieves an 
improvement for some of the mod2 values, but several of the mod2 values tested 
produce worse results than the basis value of 1.0. However, since these very simple 
first tests were based on relatively few examples and only intended to assist in 
choosing the values for modi and mod2 to be used in more extensive comparative 
tests, neither the stable behavior of the first theory nor the much less stable results for 
the second theory can be seem as very relevant. 

Among the values tried for modi and mod2, the best results were obtained for 
modi = 0.7 and for mod2 = 1.7. Those best values for modi and mod2 were then 
tested with rule sets developed over the sets of 300 learning examples (284 for Brisa), 
and applied over the five sets of 178 testing examples. 

In these tests, a more complex procedure was used to try to achieve more stable 
results. Due to the method used to choose the “best” values for the two theory 
modifiers, and to the non-stationary nature of the time series involved, we wanted to 
keep a separation between the training and test sets, based on a strict time frontier: All 
the examples before that point are seen as training examples with a known outcome. 
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and all the examples after that point are regarded as previously unseen test examples. 
That would lead to a simple holdout testing method that, due to the reduced number 
of available examples and to the small number of individual tests, would not produce 
reliable results, and would not allow a meaningful statistic significance analysis. 

To try to circumvent this problem we opted for a test methodology that combines 
the simple holdout [10] and a modified bootstrap [5]. This test methodology uses 100 
tests for each of the five stocks. In each of those tests, a model is learned on the basis 
of a bootstrap sample of the training examples (sampling examples from the original 
training set, using replacement, until a number of examples equal to the number in the 
original set is attained) and that model is tested over the complete, original, set of 
previously unseen 178 test examples. This way, each model is learned from 
approximately 63.2% of the training examples [5], and the models present exactly the 
same variability of standard bootstrap models learned over the training examples (in 
fact, they are learned exactly the same way). The tests, however, are always 
performed over the complete set of “out-of-sample” test examples (the best set of test 
examples we have), assuring that (unlike the standard bootstrap [10]) no optimistic 
“contamination” of results is possible. The bootstrap extraction of learning sets of 
examples is used only to generate variability, and results in a reduced prediction 
accuracy (because some of the training examples are left unused in the learning of 
each model) but maintains a fair test setting for the comparative tests of modifier 
values we want to conduct. 

Table 1 shows the accuracy results obtained over the five data sets. 

Table 1. Percentage accuracy for the neutral and best values of modi and mod2 





mod1=1 .0 
mod2=1 .0 


mod1=0.7 

mod2=1.0 


modi =1.0 
mod2=1.7 


modi =0.7 
mod2=1 .7 


BCP 


55.88 


55.81 


55.99 


56.82 


Brisa 


52.02 


52.83 


52.46 


53.20 


Cimpor 


52.66 


53.09 


52.36 


53.53 


EDP 


51.56 


52.26 


52.03 


52.31 


PT 


57.19 


56.95 


57.97 


58.15 


Average 


53.86 


54.19 


54.16 


54.80 



Comparing the results of Table 1 (accuracy values close to 54%) and those 
indicated in Figure 1 (values close to 56%), a global accuracy decrease is clear. This 
decrease is mainly due to a very different behavior of the BVL stock exchange during 
the period corresponding to the learning examples (high volatility with a strong global 
raise) and during the period used to generate the test examples (a steady drop in the 
quote values). In those conditions, being able to achieve, over the test examples, 
global results clearly above the 50% level seems a strong indication that valid 
prediction patterns were in fact extracted from the training examples (both using the 
standard version of PA3 and using the versions with integrated DK). A secondary 
reason for the reduced accuracy is the test methodology that only uses about 63.2% of 
the available training examples to generate the prediction models, but this factor is 
partially offset by the increased number of available training examples in these tests. 
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The average results in Table 1 show that both theories, used in isolation, produce 
small accuracy improvements and that, used together, they result in a clearly greater 
improvement. 

The results obtained for each stock show that when the integration of one of the 
theories in isolation results in a decreased accuracy (in only 3 out of 10 tests), the 
decrease if very small. The two theories combined always result in improved accuracy 
(in 5 out of 5 tests). This behavior seems to indicate that the average results can be 
regarded as relatively stable. 

As previously referred, the prediction of this kind of financial time series would be 
impossible if the markets involved were theoretically efficient. That does not seem to 
be the case of most markets, and specifically of the market we are studying. However, 
even when stock markets are not theoretically efficient, that hypothesis does not seem 
to be very far from being true, and the predictability of stock quotes time series is 
always marginal. This way, in our binary prediction setting, a prediction accuracy 
close to 50% should be expected and any global percent accuracy improvement based 
in better data mining techniques must be marginal. This tends to result in a difficult 
setting for the analysis of the statistic significance of any data mining improvements. 
This global problem is compounded by the time-based sequential nature of the 
problems, by the relatively small number of available examples and by a large 
variability effect that can be associated with the noisy data [12]. 

To conduct a meaningful significance test of the theories integration, we used our 
bootstrap-based setup to analyze the number of times the altered algorithms achieved 
better, equal or worse results than the non-altered version (with modi = mod2 = 1.0). 
The involved test methodology allows us to conduct any desired number of tests with 
models that exhibit a bootstrap-like variance and still are tested in a strict holdout 
setup, allowing the tightening of the confidence intervals [13]. 

Table 2 shows the test results for the 100 runs for each of the five stocks that also 
produced the accuracy results of Table 1. 

Table 2. Number of better, equal and worse results in relation to the basic, unmodified 
algorithm 





modi =0.7 
mod2=1 .0 


modi =1.0 
mod2=1 .7 


modi =0.7 
mod2=1 .7 


B 


E 


W 


B 


E 


W 


B 


E 


W 


BCP 


50 


6 


44 


49 


5 


46 


54 


8 


38 


Brisa 


52 


5 


43 


47 


9 


44 


56 


1 


43 


Cimpor 


53 


6 


41 


43 


6 


51 


58 


5 


37 


EDP 


51 


5 


44 


51 


4 


45 


49 


5 


46 


PT 


47 


4 


49 


52 


3 


45 


54 


4 


42 


Average 


50.6 


5.2 


44.2 


48.4 


5.4 


46.2 


54.2 


4.6 


41.2 



One of the points that can be noticed in the results shown in Table 2 is the 
relatively large number of equal results. This is basically due to the fact that, in some 
runs, the mined data (that, in these tests, includes a number of repeated examples) is 
stable enough to generate exactly the same rule sets, in spite of the introduced search 
bias. Another interesting point is that in these results, only 2 of the 10 tests that 
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compare the isolated theories with the unmodified algorithm produce more worse than 
better results (the slightly worse result of the isolated first theory in the accuracy 
results of the BCP stock is now inverted). This (average) worse accuracy result (see 
Table 1) is due to a small number of very bad results in some of the 100 accuracy 
tests (results that can be considered outliers). 

The global results for each theory and for the two theories combined are consistent 
with the accuracy results shown in Table 1: In isolation, both theories produce a small 
but clear improvement over the unmodified algorithm, and the first theory produces a 
greater improvement then the second. When used together, the two theories produce a 
considerably greater improvement. 

Applying a traditional significance analysis (single-sided paired t tests [13]) to the 
results in Table 2, the same general effects are detected: The average results for the 
first theory prove to be better than those of the unmodified algorithm version with 
95% significance. The average results for the second theory are better than those of 
the unmodified algorithm version, but only with 68% significance. The average 
results for the two theories combined are better than those of the unmodified 
algorithm version with 98% significance. This last result seems to correspond to a 
meaningful prediction improvement in the difficult domain involved. 

It should be pointed out that the test methodology we used expands the number of 
available examples by using simultaneous data from different stocks instead of a 
longer time frame of the same stock. This is common in stock time series data mining, 
but implies that the examples from each stock are not correlated which, of course, is 
not entirely true. 



7 Conclusions 

The work described in this paper reinforced our belief that direct use of DK in the 
core data mining phase of a KDD process can improve the overall efficiency of some 
knowledge discovery processes. In particular, changing the rule evaluation in order to 
introduce domain specific deformations in what would otherwise be an unbiased 
setting seems a promising way of integrating domain specific knowledge in the data 
mining phase of KDD processes that use rule induction algorithms. 

As further work, we intend to test the present theories over more extensive stock 
market data. We also intend to evaluate, over the same domain, other globally 
applicable theories in the line of those involved in the present tests. 
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Abstract. Classification is a function that matches a new object with 
one of the predefined classes. Document classification is characterized 
by the large number of attributes involved in the objects (documents). 
The traditional method of building a single classifier to do all the 
classification work would incur a high overhead. Hierarchical classifi- 
cation is a more efficient method — instead of a single classifier, we 
use a set of classifiers distributed over a class taxonomy, one for each 
internal node. However, once a misclassification occurs at a high level 
class, it may result in a class that is far apart from the correct one. 
An existing approach to coping with this problem requires terms also 
to be arranged hierarchically. In this paper, instead of overhauling the 
classifier itself, we propose mechanisms to detect misclassification and 
take appropriate actions. We then discuss an alternative that masks 
the misclassification based on a well known software fault tolerance 
technique. Our experiments show our algorithms represent a good 
trade-off between speed and accuracy in most applications. 

Keywords: Hierarchical document classification, naive Bayesian classi- 
fier, error control, class taxonomy, parallel algorithm 



1 Introduction 

Classification is a function that matches a new object with one of the predefi- 
ned classes. A special kind of classification, document classification, has recently 
caught researchers’ attention l3r2i2H!.A document classifier categorizes the do- 
cuments into the classes based on their content. This problem is characterized 
by the large number of attributes involved in the objects (documents). While 
a few hundred attributes are considered as very big for a traditional classifier, 
documents often contain thousands or even tens of thousands of terms. The 
traditional method of building a single classifier for all the classification work, 
known as flat classification, would incur a high overhead. 

Koller and Sahami ini propose the use of hierarchical classification in this 
context. Instead of a single classifier, a set of classifiers distributed over a class 
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taxonomy are used, one for each node. A document is classified in a top-down 
fashion from the root to the leaf. For each current node (i.e. class), the child of 
maximum likelihood is selected. Thus, by decomposing a job into smaller jobs like 
this and some other techniques (e.g. feature selection), the amount of work can 
be maintained at a manageable level. This method is called simple hierarchical 
classification in this paper. However, once a misclassification occurs at a high 
level node, there is little chance to accommodate it at the low levels. The deeper 
the classification goes, the further it drifts away from the correct one. A variation 
of simple hierarchical classification, known as TAPER, is proposed in To 
avoid misclassification, it attempts to search for a global optimal probability by 
assigning the probability to the edge of the taxonomy graph in some ways that 
would transform the search into a least-cost path problem. 

Weiss and Kulikowski m propose a different scheme of which one of the main 
goals is to remedy the misclassification problem. They utilize a single classifier 
over ‘global’ terms. The classifier is actually a set of special kind of association 
rules whose right sides are class labels, but only a portion of the rules are selected 
for the classification. A problem with this scheme is that in addition to class 
hierarchy, a term hierarchy is also required, which does not always exist. Also it 
is not clear if the selection rule can adequately reduce the number of association 
rules to make the job by the lone classifier manageable in the general case. 

In this paper, we attack this misclassification problem from a different angle. 
We adopt hierarchical classification model due to its efficiency, but instead of 
trying to reduce the misclassification rate by overhauling the classifier itself, 
we develop mechanisms to detect the misclassification as early as possible and 
then take appropriate actions. We also discuss an alternative that masks the 
misclassification using a well known software fault tolerance technique. 

The rest of this paper is organized as follows. In Section 2, we present a 
general model for document classification using hierarchical classifiers. In Section 
3, the two error control schemes are introduced. We move to the experimental 
results in Section 4 and finally, we conclude this paper in Section 5. 



2 Document Classification 

Informally, a document is a pattern which consists of a number of terms and 
is attached with a class value (topic). Each term can occur multiple times in 
a document. The dependencies between the class values and the terms follow 
certain probabilistic distribution. 

More specifically, we adopt a naive Bayesian model from Each class c is 
associated with a multinomial term-variable Vc- Vc can take values i, 1 < i < Uc, 
with probability pi^c where each i denotes a term and Uc is the total number of 
different terms. A document in a class is then modeled as a collection of values 
(duplicates allowed) that the associated variable V), generates successively. Let d 
be a given document in class c, hd be its length, Zi^d be the number of occurrences 
of value i in d, and Zc = j=i 2 ■■■ I probability that 

a randomly chosen document is d given that it is in c. Then we have 
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where the value of pi^c can be estimated as 
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intuitive value See UHl for a justification. 

Let T be the class taxonomy, c be an internal node and Ci where 1 < i < q 
be the tth child of c. Given a document d, the classifier at node c classifies it 
into one of ci, • • • , Cg by choosing Ci that maximizes P{ci \ c,d), the probability 
of d belonging to c given it belongs to c'. 
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where P{ci,d) is the probability that we are given a document d and d belongs 
to Ci] P{d I Cj) is estimated as stated above and P{ck) can be estimated as the 
fraction of the number of the documents that belongs to class Ck- 

Since documents can contain a large number of terms, we must perform 
feature selection to reduce the cost. In addition, it can separate unindicative 
terms, or noise, from feature terms and increase accuracy of the classifier, since 
too many features may cause overfitting and loss of generality. As described in 
features are context sensitive, meaning that we have different features at 
different splits in the taxonomy. Thus feature selection should be carried out at 
each split in the taxonomy. 

One way to do feature selection is to use Fisher Index Let tk be the A:th 
term, w{tk, d) be the relative frequency of term tk in document d, and aw(tk, c) 
= -^^Sd£cw{tk,d). Thus, the Fisher Index of tk for class c is: 



Fisherftk, c) 



^1=1 I Cj I {aw{tk,cf) - aw{tk,c)Y 
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( 3 ) 



The idea is that a smaller value for the denominator implies a closer distance 
along dimension tk among the points within each class, and a larger value for 
the numerator signifies a larger distance between any class and c. Thus a larger 
Fisher Index indicates a larger discriminative power of a term for a class. Let 
L be the list of terms in the descending order of their Fisher Indexes for c. We 
pick up a prefix F of L, and use F for the classification for c. Since F leaves out 
most noise terms, it reduces misclassification. The number of terms in F, known 
as the feature length, is a choice by users. 



3 Error Control Schemes 

Since simple hierarchical classification is problematic when a misclassification 
occurs at an early level, our approach is to incorporate error control mechanisms 
into the algorithm. We propose two schemes, namely recovery oriented error 
handling and error masking. The latter is a parallel algorithm and should run 
on a multi-processor machine. 
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3.1 Recovery Oriented Error Handling 



The recovery oriented error handling approach is inspired by the way a transac- 
tional database is recovered upon failure. When a failure occurs in a transactional 
database, a previous consistent state is reconstructed, and an appropriate reco- 
very action is taken based on that state. To bring the idea of database recovery 
to document classification, a consistent state here means an ancestor class node 
to which a document is classified with high confidence. We call it a High Con- 
fidence Ancestor (HCA). When a document is misclassified into a wrong path 
in the class taxonomy, we can restart from the HCA and then select another 
path. However, from our empirical studies rollback and reclassification are very 
time consuming. To simulate the effects of recovery, we try to identify the wrong 
paths first and avoid them during the classification. 

To detect the wrong paths, we associate each document with a value called 
closeness indicator (Cl) to indicate how close the document is to a given topic. 
Once a document is misclassified, the more it descends along the selected path, 
the further it would drift away from the distribution represented by the nodes 
in the path. When Cl drops below a certain threshold, we may conclude that 
we are on the wrong path. For example, consider the class taxonomy depicted 
in Fig.d Assume a document is about ‘folk dance’, but has been misclassified 
into ‘Business’. While this may seem not entirely unacceptable, it would be 
less acceptable to classify it into either ‘Financial’ or ‘Insurance’. Suppose it is 
classified into ‘Financial’ by the classifier at ‘Business’ node. Then it faces the 
choices of ‘Investment in stock market’ and ‘Portfolio arrangement of mutual 
funds’. Neither of these is remotely related to ‘folk dance’, so Cl would fall to a 
small value and the path would be rejected. 

Clearly, Cl should be calculated without referring to the probabilities we 
used in the classification. Therefore, instead of the one-step probability, the 
probability of d belongs to c given the HCA is used as CL Let c be the class that 
document d has been classified into. Let d be the HCA of c for d. The Cl of d 
with respect to c under c' is computed as: 



C'/(d,c I c') = P(c I d,c') 



P{c,d\c') 
P{d I c') 



P{c I c')P{d I c) 
E,^,,P{t\d)P{d\i)- 



(4) 



A simple way to determine the threshold is to use 1/N, where N is the total 
number of classes at the same level as c and leaf classes at some level above (0, 
in the subtree rooted at cL 

We maintain a moving window of I levels where I is a user parameter. The 
top and the bottom of the window correspond respectively to the levels of the 
current HCA and the class into which the document is being classified. Initially, 
the HCA is the root. The window moves downwards one level when the class 
at the bottom edge passes the test by Cl, resulting in a new HCA at one level 
lower than it was prior to the move of the window. 

^ The class taxonomy can be an unbalanced tree. 
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Recreation Business 





Insurance Financial 




Base Foot Folk Rock In vet Portf. arr. 

Ball Ball Dance Roll in S.M ofM.F 



Fig. 1. A class taxonomy 



Algorithm hc_recovery_oriented(T, d, 1) 

1 1 T: class taxonomy 
{ I d: document to be classified 

//I: difference of the level of HCA and that of the current node 

1. HCA -f- root(T) 

2. Loop 

3. Cl-list find_CI_list(//CA, 1) 

4. If no_of_element(C*/_Hst) = 1 then 

5. result.class ■<— only element of CIJist 

6. Else 

7. result-class ■<— arg maxceCi_Hst {local_prob(L7CA, c) } 

8. Endif 

9. If result-class is leaf Then return result-class 

10. HCA child of HCA who is an ancestor of result.class 

11. Until forever 



Fig. 2. Pseudo code for recovery oriented error control 



Fig.H shows the pseudo code of the recovery oriented scheme. Before we do 
the real classification, the Cl of nodes I levels ahead are calculated so the list of 
classes that pass the Cl test is known. The algorithm will select the optimal path 
with maximum locaLprob(i/C'A, c) (defined below) among all such classes. If 
there is only one class passing the Cl test, we jump to that class directly without 
further calculations. The functions used are listed below: 

find_CI_list(c, 1) Suppose the level of c is i. Return a list of classes at level I + i that 
passes Cl test and any leaf classes between level i + 1 and level 
i + I — 1 that passes Cl test. 

locaLprob(ro, r„) Suppose tq is an ancestor of r„ and the path along ro to r„ is ro 
—>• ri r„. Return p(n|ro) p(r 2 |ri) • ■ • p{r„\rn-i). In plain 

words, we are multiplying the one-step probabilities along ro to r„ 
together to get p(rn|ro). 



3.2 Error Masking 

The error masking scheme is based on the idea behind software fault tolerance. 
Instead of detecting error and then performing recovery, we use multiple pro- 
grams employing different designs. Among the outputs generated by these pro- 
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Algorithm hc_error_mask(T,d,Z,/,/ ) 

1 1 T: class taxonomy 
j I d: document to be classified 

//I: difference of the level of HCA and that of the current node 
// fif- two feature lengths for use with two O-classifiers 

1. n ■<— root(T) //level zero 

2 . level I 1 

3. (ci, C 2 , C 3 ) (n, n, n) 

4. While Cl is not leaf do 

5. Start three threads: 

6 . (i) Cl 0-classifier(ci , T, d, level, /) 

7. (ii) C 2 N-classifier(n, T, d, level) 

8 . (iii) C3 0-classifier(ci , T, d, level, f ) 

9. Wait the finish of all threads 

10. If not (ci — C 2 — C 3 ) Then 

11 . Cl ■<— majority of ci, C 2 and C 3 (take a predefined action, e.g. using C 2 , if no majority) 

12 . level level + 1 

13. n child of n who is an ancestor of ci 

14. Else 

15. level level + I 

16. n ■<— Cl 

17. Endif 

18. Endwhile 

19. Return ci 



Fig. 3. Pseudo code for error masking scheme 



grams, the one generated by a majority is considered correct. More programs 
will generate more reliable results, but consume more resource. 

We adopt a moderate approach. We run three classification methods in paral- 
lel. The first and third classifications are hierarchical classifications of traditional 
sense. The second classification is performed by dynamically skipping some le- 
vels in the class taxonomy. For example, to classify a document based on the 
taxonomy in Fig. ^ we can perform an additional classification by first skipping 
level 1 (i.e. {Recreation, Business}). Say the three classifications end up with 
class ‘Dancing’. We then classify it at node ‘Recreation’. But this time we skip 
the left part of level 2, i.e., {Sports, Dancing}. In the following discussion, we 
use the terms ‘N-classifier’ and ‘0-classifier’ respectively to refer to the classi- 
fiers with and without skipping the levels. The third classification is to employ 
0-classifier again but with a different feature length. A majority voting scheme 
is used to decide the overall output. 

Skipping some levels has the effect of (partially) globalizing the information 
for the classification, and therefore can possibly reducing the misclassification 
rate. The more levels skipped, the more likely it is to reduce misclassification 
rate. In the extreme, if all but the leaf and root levels are skipped, we have 
a fiat classifier. However, skipping a large number of levels beats one of the 
main motivations for using a hierarchical classifier, i.e., handling the complexity 
involved in the document classification. Thus a trade-off must be made. How to 
make such a trade-off is application dependent and is determined by users. In 
general, more levels can be skipped if the taxonomy has a large height but a 
small width than the other way around. 
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Fig.EI shows the pseudo code for the error masking scheme. At line 11, if 
there is no majority formed by the classifiers, a user-defined action should be 
taken. For our experiments, this action is to use C2, because usually, this is the 
most accurate (and slowest) classifier. Line 15-16 are some optimization codes. 
If Cl, C2 and C3 all match, we are confident that it is on the right track and so we 
can make a bold move — Instead of advancing one level, our algorithm moves I 
levels ahead. Some functions used are defined below: 

0-classifier(c, T, d, k, f) To classify the document d using 0-classifier in the taxo- 
nomy T from class c to reach a class in level fc or a leaf 
class at a level higher than k. The feature length to use in 
classification is /. 

N-classifier(c, T, d, k) To classify the document d using N-classifier in the taxo- 
nomy T from class c to reach a class in level A: or a leaf 
class at a level higher than k. 

4 Performance Evaluation 

In this section, we study the performance of the algorithms. We implemented 
five document classification algorithms in C-| — h. Our algorithms, namely the re- 
covery oriented and the error masking schemes, are compared against simple 
hierarchical classification, fiat classification and TAPER We run the experi- 
ments on a Sun Enterprise E4500 machine with 12 processors. Response time, 
rather than total CPU time, is measured so the error masking scheme can take 
advantage of parallelism. 

We are interested in data sets with reasonably large class taxonomies, because 
the advantages of skipping levels can only be fully exploited in such data sets. We 
have chosen the data set of US patenttO because they are organized in a large 
taxonomy. Three sets of data are collected from the US patent database. For 
convenience, we name them DataS88, Data^TAPER and Data^Four. Data_388 is 
the top-level class numbered 388 (motor control system) on the patent database. 
The class taxonomy is formed by all the 98 subclasses under the class 388. 
In each subclass, we download at most 20 patents, resulting in 901 patents. 
Data_TAPER highly resembles a data set used in 0. The taxonomy of this data 
set is shown at Fig. 4(a). There are 12 leaf classes, each of which is a top-level 
class in the patent database. There are 500 training patents and 300 validation 
patents picked randomly from each leaf. However, since Data.TAPER is a three 
level data set, it is insufficient to demonstrate all the features of our algorithms 
while Data_388 only consists of a small number of patents. In Data_Four, we 
expand Data_TAPER by introducing more classes from the US patent database 
and grow the taxonomy by one level. The resulting class taxonomy is shown at 
Fig. 4(b). 

Fig.EI Eland Qshow the accuracy and performance of the different algorithms 
on the three data sets. First of all, we achieve 65-70% accuracy in Data_TAPER, 

^ Available at several places on Internet, e.g. Delphion Intellectual Property Network 
(http:/ /www. delphion. com/). 
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(a) Class taxonomy for Data_TAPER 



(b) Class taxonomy for Data_Four 



Fig. 4. Class taxonomies for some data sets used in the experiments 



which is similar to the result in @j. Among all the experiments, simple hierarchi- 
cal classification is always the fastest algorithm and therefore it is the baseline of 
our comparison. Generally speaking, it is not justified to use a more complicated 
classification scheme unless it is more accurate than the fastest algorithm. 

As simple hierarchical classification classifies the document in a greedy man- 
ner but TAPER searches the whole tree for the maximum overall probability, 
TAPER guarantees at least as good accuracy as simple hierarchical classifica- 
tion, although more time is required for the extra search. From the experiments, 
however, TAPER gives almost the same accuracy as the simple hierarchical 
classification. This result suggests the greedy approach of the simple hierarchi- 
cal algorithm is close to optimal. Exhaustive search does not help to boost the 
accuracy. If we are to increase the accuracy, there must be a different approach 
to classify the documents. This is another reason to skip levels. 

Flat classification gives the best accuracy in most case^. However, from our 
experiments, it is clear that this is also the most time consuming algorithm 
except in the smallest taxonomy (Data_TAPER). Our algorithms stand on a 
middle ground between speed and accuracy. Our algorithms consistently beat 
TAPER and simple hierarchical algorithms in terms of accuracy. The recovery 
oriented scheme even slightly suppresses the fiat classification in accuracy on 
Data_TAPER. However, it does not run fast since the recovery oriented scheme 
is essentially doing both a simple hierarchical and fiat classification in a three 
level data set. In a bigger taxonomy (Data_388 and Data_Four), the recovery 
oriented scheme is clearly faster than fiat classification, and at the same time 
more accurate than TAPER and simple hierarchical classification. 

Like the recovery oriented scheme, the error masking scheme is also faster 
than fiat classification and more accurate than TAPER and simple hierarchical 
classification in a large taxonomy. When comparing between the two error control 
schemes, it is found that there are some cases that either scheme is faster than 
the other. Due to parallelism, it is easy to understand why the error masking 
scheme is faster. However, the optimization adopted in our implementation also 

^ There seems to be no theoretical support for this in the general case. For example, 
the contrary is claimed in 0. 
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Fig. 6. Experimental result on Data_TAPER 




Fig. 7. Experimental result on Data_Four 
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gives the recovery oriented scheme an edge which explains why the recovery 
oriented scheme is faster in many cases. In recovery oriented scheme, if only 
one class is found to pass the closeness indicator, we will jump to that class 
directly. The one-step probability is not calculated. This is a considerable saving 
in running time. In contrast, the error masking scheme always classifies the 
document by three different classifiers and the slowest one will determine the 
response time. As for accuracy, the error masking scheme, while still ahead of 
TAPER and simple hierarchical classification, is often less accurate than the 
recovery oriented scheme. As the recover oriented scheme does not require a 
multi-processor machine, we feel that the recovery oriented scheme is preferable 
over the error masking scheme. 

In a serious application, we expect a large class taxonomy. From the expe- 
riments, the response time difference between flat and simple hierarchical clas- 
sification widens as the size of taxonomy grows. While the accuracy of simple 
hierarchical classification may not be satisfactory, switching to flat classification 
is too radical and computationally expensive. As our algorithms can be faster 
than flat classification at a taxonomy of as low as four levels (Data_Four), they 
represent a good trade-off between speed and accuracy for most applications. 
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Classification has been studied extensively in the last decades 
inrn\ . However, most of the work on the classification ignores the hierarchi- 
cal structure of classes. In P|, the authors explore the hierarchical structure of 
attributes to improve the efficiency, but assume only a single level of classes. 
The work reported in |41 1 2 \ propose hierarchical classification based on the class 
taxonomy in the context of document classification. The work in m discusses 
document classification without using hierarchical classification. Bayesian net- 
work as a model for data mining has been studied in jbfill (17j . Feature selections 
are discussed in some work nms!. The general method is to define a measure 
first and then search for a subset of features that optimize this measure. Fisher 
Index method in P] also follows this line, it does so however in a ‘localized’ 
manner, i.e. one term at a time. Although this local method has the weakness 
of not considering the fact that sometimes terms may be related, it does reduce 
the complexity when the number of features is very large. 

In this paper, we have studied document classification using hierarchical clas- 
sifiers with error control capability. We demonstrate that some well established 
strategies in other areas can also find a way to enhance the performance in our 
context. Two methods are proposed, recovery oriented and error masking. Re- 
covery oriented method ‘detects’ an error and rejects it, while error masking 
method ‘masks’ an outcome under suspicion by adopting a better one. Our ex- 
periments show that both methods consistently reduce the misclassification rate 
against TAPER and simple hierarchical classification. The cost is extra running 
time, but they are faster than flat classification on a large taxonomy. Our algo- 
rithms are suitable for classifying documents into a large taxonomy where the 
users are willing to spend the extra time to trade for a higher accuracy. 





Hierarchical Classification of Documents with Error Control 



443 



References 

1. H. Almualim, Y. Akiba, S. Kaneda, “An efficient algorithm for finding optimal 
gain-ratio multiple-split tests on hierarchical attributes in decision tree learning” , 
Proc. of National Conf. on Artificial Intelligence, AAAI 1996, pp 703 - 708. 

2. R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer and A. Swami, “An interval classifier 
for database mining applications”, Proc. of VLDB, 1992, pp 560 - 573. 

3. L. Breiman, J. Friedman, R. Olshen and C. Stone, “Classification and regression 
trees”, Wadsworth, Belmont, 1984. 

4. S. Chakrabarti, B. Dom, R. Agrawal and P. Raghavan, “Using taxonomy, discrimi- 
nants, and signatures for navigating in text databases”, Proc. of the 23rd VLDB, 
1997, pp 446 - 455. 

5. K. Cios, W. Pedrycz and r. Swiniarski, “Data mining methods for knowledge dis- 
covery”, Kluwer Academic Publishers, 1998. 

6. P. Cheeseman, J. Kelly, M. Self, “AutoClass: a Bayesian classification system”, 
Proc. of 5th Int’l Conf. on Machine Learning, Morgan Kaufman, June 1988. 

7. N. Friedman and M. Goldszmidt, “Building classifiers using Bayesian networks”, 
Proc. of AAAI, 1996, 1277 - 1284. 

8. T. Fukuda, Y. Morimoto and S. Morishita, “Constructing efficient decision trees 
by using optimized numeric association rules”, Proc. Of VLDB, 1996, pp 146 - 155. 

9. J. Gehrke, R. Ramakrishnan and V. Ganti, “Rainforest - a framework for fast 
decision tree construction of large datasets”, Proc. of VLDB, 1998, pp 416 -427. 

10. D. Heckerman, “Bayesian networks for data mining”, Data Mining and Knowledge 
Diseovery, 1, 1997, pp 79 - 119. 

11. D. Koller and M. Sahami, “Toward optimal feature selection”, Proc. of Int’l. Conf. 
on Machine Learning, Vol. 13, Morgan-Kaufmann, 1996. 

12. D. Koller and M. Sahami, “Hierarchically classifying documents using very few 
words”, Proc. of the 14th Int’l. Conf. on Machine Learning, 1997, pp 170 - 178. 

13. M. Mehta, R. Agrawal and J Rissanen, “SLIQ: a fast scalable classifier for data 
mining”, Proc. of fifth Int’l Conf. on EDBT, March 1996 

14. J. Quinlan, “Induction of decision trees”. Machine Learning, 1986, pp 81 - 106. 

15. J. Quinlan, “C4.5: programs for machine learning”, Morgan Kaufman, 1993. 

16. G. Salton, “Automatic text processing, the transformation analysis and retrieval 
of information by computer”, Addison - Wesley, 1989. 

17. J. Shafer, R. Agrawal and M. Mehta, “Sprint: a scalable parallel classifier for data 
mining”, Proc. of the 22nd VLDB, 1996, pp 544 - 555. 

18. E. S. Ristad, “A natural law of succession”, Research report CS-TR-495-95, Prin- 
ceton University, July 1995. 

19. S. Weiss, and C. Kulikowski, “Gomputer systems that learn: Classification and pre- 
diction methods from statistics, neural nets, machine learning and expert systems” , 
Morgan Faufman, 1991. 

20. K. Wang, S. Zhou and S. C. Liew, “Building hierarchical classifiers using class 
proximity”, Proc. of the 25th VLDB, 1999, pp 363 - 374. 

21. Y. Morimoto, T. Fukuda, H. Matsuzawa, T. Tokuyama and K. Yoda, “Algorithms 
for mining association rules for binary segmentations of huge categorical databa- 
ses”, Proc. of VLDB, 1998. 




An Efficient Data Compression Approach to the 
Classification Task 



Claudia Diamantini and Maurizio Panti 

Computer Science Institute, University of Ancona, via Brecce Bianche, 60131 

Ancona, Italy 

{Diamanti, PcUiti}@inf orm.unian. it 



Abstract. The paper illustrates a data compression approach to classi- 
fication, based on a stochastic gradient algorithm for the minimization of 
the average misclassification risk performed by a Labeled Vector Quan- 
tizer. The main properties of the approach can be summarized in terms 
of both the efficiency of the learning process, and the efficiency and ac- 
cnracy of the classification process. The approach is compared with the 
strictly related nearest neighbor rule, and with two data reduction algo- 
rithms, SVM and IB2, on a set of real data experiments taken from the 
UCI repository. 



1 Introduction 



Data mining can be viewed as a model induction task, that is the task of building 
a descriptive or predictive model of a phenomenon starting from a set of instances 
of the phenomenon itself. In particular, the scope of this paper is on predictive 
model induction. This problem has been undertaken for a long time in disciplines 
like statistics, pattern recognition and machine learning. In pattern recognition, 
non-parametric classification methods has been developed, such as the nearest 
neighbor classifier 0. This method is often considered for data mining tasks, for 
its conceptual simplicity, associated to good classification performance, which 
often turns out to compete with those of other, more sophisticated, approaches 
BE!- However, it presents also a severe limit for data mining, namely the fact 
that the entire training set has to be processed in order to classify a new datum, 
making infeasible its application to huge databases. 

To reduce the classification cost of the nearest neighbor classifier, data reduc- 
tion techniques were introduced iiiHiiaiu . The aim of data reduction is to select, 
from the whole set of data, the subset which allows the minimum degradation in 
performance, introducing in this way an accuracy vs efficiency tradeoff. It also 
introduces a cost for “learning” (i.e. the running of the reduction algorithm) 
which, very often, turns out to be itself too expensive to be applied to large 
databases, so an important research topic in data mining is the development of 
techniques in order to improve algorithm scalability | |9I12I15| . 

A complementary approach to data reduction is data compression [I ill I lll Tj . 
In this approach, the aim of learning is to compress the information contained 
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in the original data in a new, reduced, set of elements. The main advantages 
of compression over reduction techniques can be summarized in the fact that 
compression units are free to move in the feature space, allowing in principle to 
reach a more accurate classifier design, and in the fact that the complexity of the 
design is independent of the size of the training set. Of course, compression, as 
well as data reduction, works fine only if the learning criterion is appropriate. As 
a matter of fact, in data reduction, learning algorithms were naturally concerned 
with the classification problem, while data compression is historically related to 
the problem of data reproduction, hence to the minimum mean squared dis- 
tortion criterion nzm!. Kohonen mu introduced for the first time some data 
compression algorithms complying with the classification task, but mainly on an 
intuitive basis. 

In this paper, we want to bring to the attention of researchers the features 
of a data compression approach to classification, based on a stochastic gradient 
algorithm for the minimization of the average misclassification risk performed by 
a Labeled Vector Quantizer (LVQ). The approach has the following advantages: 

— Minimization of the average misclassification risk allows to guarantee that 
the learning guides the LVQ towards (local) optimal classification perfor- 
mances. That is, it guarantees the effectiveness of the learning process; 

— The use of a stochastic gradient algorithm guarantees the efficiency of the 
learning process. In particular, the use of one sample per iteration allows to 
keep data on hard-disks during the learning, with no accuracy vs efficiency 
tradeoff; 

— The particular quantization architecture adopted allows to design a nearest 
neighbor classification rule, that is a very simple rule, which outperform the 
classical nearest neighbor classifier; 

— LVQ architectures allows to strongly compress the information contained in 
the training set. Thus the classification is based on a very small number of 
elements with respect to the training set size. 

In the following, such advantages will be experimented on real data sets taken 
from the UCI Machine Learning repository, comparing the method with the 
classical Nearest Neighbor (1-NN) classifier. Support Vector Machines (SVM) 
pm and the IB2 algorithm of the Instance Based Learning family [Q. 

2 The Bayes Vector Quantizer 

In the statistical approach to classification, data are described by a continuous 
random vector x S TZ^ (feature vector) and classes by a discrete random variable 
c G C = {ci,C 2 , . . . ,cc}. Each class Ci is described in terms of the conditional 
probability density function (cpdf) Px\ci,x\ci) and the a priori probability Pc{ci). 
The predictive accuracy of a classification rule <1> : TZP — C is evaluated by the 
average misclassification risk 




( 1 ) 
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where Px{x) = X)i=i Pc{ci)px\c{x\c) , dVx denotes the differential volume in the 
X space, and R{cj\x) is the risk in deciding for class Cj when a particular x is 
observed. R{cj\x) is defined as 



C 

R{cj\x) = ^6(ci,Cj)Pc|x(cj|a;) . (2) 

i=l 

In O, b{ci,Cj) > 0 expresses the cost of an erroneous classification, i.e. the 
cost of deciding in favor of class Cj when the true class is Ci, with b{ci, Ci) = 0 Vi. 
If b{ci,Cj) = I ^ j the average misclassification risk turns to the simpler 

error probability. Pc\-x.{ci\x) can be derived from P^\^{x\ci) by the Bayes theorem. 
A well known result is that the best possible classification of a feature vector 
consists in mapping it to the class with the minimum conditional risk © (Bayes 
rule) : 

d>B{x) = min“^{i?(c|a;)} . (3) 

cGC 

The development of non-parametric methods and learning algorithms for 
classification, arise from the attempt to overcome the limits of applicability of 
this optimal rule, related to the the fact that cpdfs involved in are in general 
unknown. Thus, their ultimate goal is to obtain an estimate of such functions of 
X on the basis of the training set. 

In this paper we take a different approach, based on the observation that a 
classification rule <1> : RP — >■ C, being a total and surjective function, induces a 
partition of the feature space RP into C regions i?i, . . . , Rc (decision regions), 
where Ri is the set points which are pre-images of class Ci in P". Starting from 
this observation, we propose an algorithm to adapt an initial labeled partition, 
where labels represent classes, towards the optimal partition induced by the 
Bayes rule. Notice that, in this way, the function of x we try to approximate is 
directly <Pb{x), which, under the hypothesis of piecewise continuity of cpdf, is a 
piecewise constant function. We encode a labeled partition by a Labeled nearest 
neighbor Vector Quantizer. 

A nearest neighbor Vector Quantizer (VQ) of size M is a mapping 

n-.R^ ^M, 

where M = {mi, m 2 , . . . , ttim}, 1 x 1 ^ G 72.", m^ yf rrij, which defines a partition of 
72" into M regions Vi, V 2 , . . . , Vm, such that 

Vi = {x € 72" : d{x,mi) < d{x,rrij), j ^ i} ■ 

Basically, a VQ performs data compression since it represents each point of 
a region Vi by one point: rrii. Vt is the Voronoi region of code vector mi and 
d is some distance measure. If the Euclidean measure is adopted, then Voronoi 
region boundaries turns out to be piecewise linear. In particular, the boundary 
Sij between two regions Vi and Vj is a piece of the hyperplane equidistant from 
mi and mj (see Figure D!a)). In the following we will always refer to nearest 
neighbor VQs with Euclidean distance. 
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The VQ is extended with a further mapping A:j\4 —)■ C, which assigns a label 
from C to each code vector. We will call this extended VQ a Labeled VQ (LVQ). 
An LVQ can be used to define a classification rule: let li = A{mi) G C denote 
the label assigned to rrii. The decision taken by the LVQ when x is presented at 
its input is 

^lvq(x) = Ao n{x) = k, \i X GVi. (4) 

In practice, the classification is performed by finding in A4 the code vector 
at minimum distance from x, and then by declaring its label. Thus an LVQ 
implements a simple nearest neighbor rule, and each point of a Voronoi region 
Vi is implicitly labeled with the same label of nii. Figure [0(b) shows an example 
of labeled partition induced by an LVQ. Notice that, even if each Voronoi region 





Fig. 1. (a) A nearest neighbor VQ of size 8 in TZ^ and (b) A possible labeled partition 
induced by labeling the VQ. 



is convex, we can construct non-convex and non connected decision regions as 
well. Notice also that boundaries between two Voronoi regions with the same 
label (the dashed lines in Figure Ql(b)) do not contribute to the definition of 
decision region boundaries (decision boundaries). The adoption of the simple 
Euclidean distance limits us to piecewise linear decision boundaries. However, 
with other distance measures, non linear boundaries could be obtained as well 
0 §10.4]. 

In order to develop an algorithm to find an optimal approximation of the 
Bayes partition, a crucial observation is that the average risk of depends 

only on the labeling function A and on the mutual position of code vectors mj, 
which determines the form of the integration regions. Thus, keeping the labeling 
function A fixed, and under the continuity hypothesis for cpdfs, average risk is 
differentiable w.r.t. A4. The gradient of R{A4) w.r.t. the generic nii has been 
derived for the first time in 0, and has the form 



V,i?(7W) 



C M 



E E 

j = l = 



b{Cj,iq) - b{Cj,k) 

II rrii - nig || 




x)p^\c{x\cj)dSoo , (5) 



where dSx denotes the differential surface in the x space. 
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Almost surprisingly, the variation of the risk w.r.t. code vector rrii depends 
only on what happens on boundary surfaces between Vi and each neighbor re- 
gion Vq (obviously Si^g vanishes if rrii and uig are not neighbors), and only in 
the case that b(cj,lq) yf b{cj,li). In the case of error probability this means 
simply that U Iq, i.e. that the boundary surface between Vt and Vq actually 
represents a part of the decision boundary. This result formalizes and generalizes 
the original intuition of Hart |3, elaborated also in the more recent, so called, 
“boundary hunting” methods mm , that all the relevant information about a 
classification problem is found in those samples falling near the decision border. 

The use of a stochastic Parzen estimate for Px\ci^\cj), and some approxi- 
mations introduced for the sake of simplicity, for which we refer the interested 
readers to Pj, leads to a class of stochastic gradient algorithms for the mini- 
mization of R{A4). We consider the algorithm, here called Bayes VQ (BVQ), 
obtained when a uniform window of side A is adopted as the Parzen window. 

Let us assume that the labeled training set C = {(ti, mi), . . . , (tr, wt)} is 
given, where ti G TiA is the feature vector and Mi S C is its class. At each 
iteration, the algorithm considers a labeled sample randomly picked from the 
training set. If the sample turns out to fall near the decision boundary, then the 
position of the two code vectors determining the boundary is updated, moving 
the code vector with the same label of the sample towards the sample itself, 
and moving away that with a different label. More precisely, the fc-th iteration 
of the BVQ algorithm is: 



BVQ Algorithm - k-th iteration 



1 . 

2 . 

3 . 

4 . 

5 . 



randomly pick a training pair ,m('=)) from £; 

find the two code vectors sind nearest to ; 

/* note: certainly such vectors are neighbors! */ 

ml ’ =ml ' for 

compute tf^j , the projection of on sgEI; 

if II t^’^^ -tf) II < A/2 then 



(k+i) (k) (k) ,lj)~ , k) , (k) 

II - ^3 II 

I (fc) (fc) 




(fc+1) {k) n ± • • 

else m\ = m\ jor t = i^j . 

FigureElillustrates the behavior of BVQ, considering two equiprobable classes 
(called black (B) and white (W) class) and error probability as the performance 
measure. In this case, the point P, located where cpdfs coincide is, by definition, 
the optimal Bayes decision boundary. Figure Eta) shows a set of samples of 
the W and B classes represented by small white and black dots respectively. 



^ Such projection is a function of the two code vectors and of only. 
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whose distribution in the feature space follows class statistics. In Figure Etb) 
is depicted an LVQ of size three, with two code vectors labeled as white and 
one code vector labeled as black. Voronoi region boundaries are 5i_2 = 
and 52,3 = . The decision regions induced by the LVQ are graphically 

represented by the dashed and solid lines. Samples falling near 5i,2 are not used 




m2 m3 

(b) 



Fig. 2. A graphical representation of the adaptation step performed by BVQ. 



to update code vectors. This is consistent with the fact that perturbations of 5i,2 
do not change the decisions taken in its around, which is always in favor of the 
white class (note however that the updating of m 2 and m 3 indirectly influences 
also 5 i, 2). Samples around 62,3 are used only if the sample falls inside the window 
of side A, represented in the Figure by the shaded area. If such a sample is of 
the B class, then b{u^^\l 3 ) = 0 and b{u'^^\l 2 ) = 1, so b{u'^^\l 2 ) — b{u^^\l 3 ) = 1. 
Hence m 3 , which, with this setting, plays the role of mi in the algorithm, is moved 
toward the sample, while m 2 is moved away. As a result, both code vectors (and 
52,3 as well) will be moved toward left. Vice-versa, if the sample is of the W class, 
both code vectors will be moved toward right. Repeated iterations of the BVQ 
algorithm moves 52,3 from its initial position toward left, since at the beginning 
black samples are more frequent than white samples. This drifting continues 
until a point is reached where the frequency of black and white samples falling 
inside the window is the same. In the asymptotic case, the number of samples 
is arbitrarily large, and the optimal value of A is zero, so this point is exactly 
the point P where cpdfs coincide. In the finite case, P can be only approached, 
since Z\ > 0. However, more samples we have, smaller values of A can be set, 
more accurate approximations of the Bayes decision boundary can be found. The 
comparison of BVQ with other VQ based approaches, performed in 0, enlightens 
the advantages of a learning formally based on average risk minimization. 

The computational cost of a single iteration of the BVQ can be divided into: 
the cost for finding the two code vectors nearest to the training vector, the cost 
for calculating the projection of the training vector on the decision border, and 
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the cost for code vectors updating. It is simple to see that the dominating factor 
is the first one, corresponding to a total oi M ■ n multiplications. This cost is 
thus independent of the training set size. This result compares favorably with the 
costs of other learning algorithms used in Data Mining, in particular SVM. As to 
the number of iterations which are necessary to approach Bayes risk, preliminary 
experimental results in |S] suggest that it does not depend on the training set 
size as well, but only on the initial position of code vectors and on the initial 
value of the step size 7 . 

3 Experimental Evidences 

In order to show the advantages of the BVQ algorithm, we compare it to 1- 
NN, SVM and IB2 on a set of real data taken from the UCI Machine Learning 
repositorjfl. Tabled gives the database size and dimension n of the feature space 
for each experiment. 



Table 1. The data sets from the UCI Repository used in the experiments. 



Data set 


Size Dimension 


Australian 


690 


14 


Diabetes 


768 


8 


German 


1000 


24 


Ionosphere 


351 


34 


Liver-Disorders 


345 


6 


Mushroom 


8124 


22 



Tabled shows the error probabilities achieved by SVM, 1-NN and IB2, to- 
gether with the size of each classifier in brackets. This size is expressed by the 
number of vectors used in the respective classification rules. Since 1-NN does not 
discard any training sample, the size of this classifier correspond to the training 
set size. The results are taken from M- They correspond to average measures 
obtained by a 10-fold cross validation method. Thus, in order to compare the 
results, the same experimental procedure was adopted for the BVQ. 

Before applying the BVQ, a normalization of data is performed, in such a way 
that each feature ranges in the same interval. This is an invertible transformation 
of data which allows to give equal importance to each vector component during 
the learning. The problem of local minima suffered by “greedy” methods afflicts 
BVQ as well. This can be alleviated by a proper initialization of code vectors, 
but in this case some domain knowledge should be given. In the experiments 
the LVQs are initialized simply by the first training vectors of each training 
set. The decreasing law for the step size was set to 7 ^^^ = where jk 

denotes the number of non-null updatings until step k and 7 *-*^^ is an initial value 
experimentally determined in the range [0.1, 0.005]. In Table El for each data set, 

^ http:/ /www. ics.uci.edu/~mlearn/MLRepository.html 
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Table 2. Error probability and number of vectors for SVM, 1-NN and IB2 on the UCI 
repository data sets. 



Data set SVM 1-NN IB2 

Australian 0.1537 (203.9) 0.185 (603) 0.2642 (151.5) 
Diabetes 0.2292 (401.7) 0.3048 (691.2) 0.3505 (253.5) 
German 0.249 (487) 0.331 (900) 0.388 (338.7) 

Ionosphere 0.0543 (167.1) 0.1543 (315.9) 0.1314 (54.7) 
Liver 0.3132 (209.7) 0.3765 (310.5) 0.4201 (121.3) 
Mushroom 0.0 (437.3) 0.0 (7311) 0.0041 (25.6) 



the optimal value of the parameter A and the best error performances achieved 
by BVQ with LVQs of varying size are reported. 



Table 3. Error probability vs number of code vectors for the BVQ on the UCI reposi- 
tory data sets. Values in boldface denote the best error probability for each experiment. 



No. Code Australian Diabetes German 
Vectors A = 0.346 A = 0.49 A = 1.549 


Ionosphere 
A = 1.897 


Liver 
A = 0.154 


Mushroom 
A = 1.897 


2 


0.1493 


0.2396 


0.291 


0.1454 


0.3338 


0.2254 


4 


0.1478 


0.2358 


0.290 


0.1398 


0.3219 


0.2129 


8 


0.1449 


0.2265 


0.277 


0.1343 


0.286 


0.100 


16 


0.1493 


0.2358 


0.270 


0.1231 


0.3038 


0.0242 


32 


0.1435 


0.2422 


0.253 


0.112 


0.3033 


0.0138 



Some comments on this Table are in order. First, we have to report a phe- 
nomenon typical of VQ architectures, called the “dead neuron problem” by Ko- 
honen. In practice, it can happens that a code vector is never used to classify 
input data. The removal of such code vectors would not modify the error proba- 
bility for the data at hand. Thus the number of code vectors reported is always 
greater than or equal to the true number of used code vectors. Second, we can 
observe a non monotonic trend of error probability vs number of code vectors for 
the australian, diabetes and liver data sets. This somewhat counter-intuitive re- 
sult can be explained by the intrinsic ’’jitter” of stochastic methods, which make 
the system to wander around the optimum, by the dead neuron problem and by 
the dependency of the result from initialization and labeling of code vectors. 

Turning to the comparison of results in Tables El and El we can observe that 
the BVQ is always the worst on the mushroom experiment. This bad performance 
can be explained by noticing that, in this experiment, samples are described 
by purely categorical features, hence cpdfs are not piecewise continuous, and 
the basic assumption for the applicability of the method is not satisfied. As 
a consequence, we can report a long number of iterations needed to converge, 
and a great sensitivity of the algorithm to the initialization of code vectors and 
to the value of the parameter Vice-versa, the algorithm proves to work 
nicely if at least some feature turns out to be continuous, as it is the case for 
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the other experiments. Here, the BVQ clearly outperforms both 1-NN and IB2, 
while its performance can be considered at least comparable to that of SVM. In 
particular, BVQ performance is about the 7%, 1% and 10% better than SVM 
on the Australian, Diabetes and Liver data sets respectively, while it is about 
the 2% and 50% worse than SVM on the German and Ionosphere data sets 
respectively. For this experiment, we failed to find a number of code vectors for 
which a comparable error probability could be obtained. This fact is likely to be 
related to the small training set size, especially compared with the dimension 
of the feature space. In this case, the large value of the window introduces a 
distortion in the estimate of o and of the cost function, and the BVQ can find 
only a poor suboptimum. It is nevertheless better than the one found by 1-NN 
and IB2. Since the latter result is likely to be due to the small training set size, 
it is not very serious, in the perspective of very large databases. 

On the other hand, the advantage of BVQ in memory requirements is striking. 
As we can see, two code vectors are sufficient to BVQ to obtain the best of all 
error probability on the Australian data set, they are sufficient to reach a lower 
error probability than 1-NN in all the experiments and a lower error probability 
than IB2 in all but the ionosphere experiment. This fact, together with the fact 
that, using 2 code vectors, BVQ already achieves as almost good results as using 
32 code vectors, gives some insights on these problems, allowing the hypothesis 
that the optimal decision boundary can be quite accurately approximated by a 
linear decision boundary. 

The highest memory requirements for the BVQ are on the german data set, 
where 32 code vectors are needed, to reach an error probability comparable to 
that of SVM, which needs 487 vectors out of 900 training samples. 

These results assume greater relevance in the perspective of very large data- 
bases, in the light of the fact that, with BVQ, the classifier size turns out to be 
related only to the geometry of the problem, that is to the number of classes and 
shape of the decision boundaries, while with the other methods the classifier size 
grows with the training set size. Preliminary results supporting this statement 
can be found in |2|. 

The size of BVQ, IB2 and 1-NN classifiers also allows to directly compare 
their computational cost, as they all use the same decision function ®. For 
instance, on the german data set, the BVQ allows to classify a sample by calcu- 
lating only 2 distances, against the 900 distance calculation of the 1-NN and the 
339 distance calculation of IB2. The decision function of SVM takes a different 
form. SVM can directly manage only two class problem^ Assuming that class 
labels are encoded by integers 1 and— 1, SVM decision function is 

s 

^SVm{x) = sgn['^UiaiK{x- s,) + b], ( 6 ) 

i=l 

where, {(si, ui), . . . , {ss, us)} C £ is the set of Support Vectors and S is the size 
of the classifier. AT(-) is either a linear or a non linear kernel (typical kernels are 

^ problems with C > 2 classes have to be managed by designing C different classifiers, 
each separating one class from the rest 
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the gaussian and sigmoid functions). If iC is a linear kernel, then this decision 
function requires approximately the same number of multiplications of (Q. In 
fact, we can rearrange the last one as follows 

^lvq(x) = A(min“^{|| rm - x | 1 ^}) = A(min“^{|| m^ \ f -x ■ mj) . 

rrii rrii 

Hence, if we store the squared norm of code vectors, which equals the cost of 
storing the weights in SVM, both the decision functions require a number of 
inner products equal to the size of the classifier. In the general case, however, the 
SVM decision function is computationally heavier than (0) , since we have to add 
the cost of the non linear kernels calculus. In the above experiments, in order to 
obtain the reported accuracy, the adopted kernel is always gaussian, except for 
the mushroom experiment where it is linear. Thus, the comparison of the size 
of BVQ and SVM classifiers allows to establish an even greater computational 
advantage of the former over the latter than the advantage observed over 1-NN 
and IB 2. 

4 Conclusions 

In the paper we presented a data compression approach to the classification 
task, based on the stochastic gradient algorithm BVQ. The main properties of 
the approach can be summarized in terms of the efficiency of the learning pro- 
cess, and efficiency and accuracy of the classification processes. Efficiency of the 
learning process is due both to the use of a stochastic gradient algorithm, which 
exploits only one training sample per iteration, and to the light computational 
cost of each iteration. Efficiency of the classification process is due to the use 
of nearest neighbor vector quantizer architectures, which allows to implement 
a simple nearest neighbor rule, based on a very small number of elements with 
respect to the training set size. Finally, both efficiency and effectiveness of the 
classification process gains from the use of the average misclassification risk as 
a learning criterion, which allows to design the VQ in such a way that the (lo- 
cally) optimal linear approximation of the Bayes decision border is found, with 
the given number of code vectors. Furthermore, although in the experiments we 
focus on error probability as the performance measure, the BVQ is a general 
algorithm for the minimization of the average risk. Thus the introduction of 
misclassification cost matrices in the formulation of the classification problem 
can be supported as well. This feature is important for practical applications, 
where some classification errors are often considered more serious than others 
(for instance, evaluating a client reliable for a loan when it is unreliable can be 
more dangerous for a bank than evaluating a reliable client unreliable) . The use 
of general cost matrices in real applications will be the scope of future research. 
Other directions of research include the study of techniques to improve BVQ 
performance, by finding better initialization strategies of code vectors and by 
developing non greedy versions of the algorithm to escape local minima. Also of 
interest is the study on the exploitation of geometric characteristics of VQs in 
order to extract symbolic classification rules from it. 
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Abstract. Supervised classification involves many heuristics, including 
the ideas of decision tree, A:-nearest neighbour (fc-NN), pattern frequency, 
neural network, and Bayesian rule, to base induction algorithms. In 
this paper, we propose a new instance-based induction algorithm which 
combines the strength of pattern frequency and distance. We define a 
neighbourhood of a test instance. If the neighbourhood contains training 
data, we use fc-NN to make decisions. Otherwise, we examine the 
support (frequency) of certain types of subsets of the test instance, and 
calculate support summations for prediction. This scheme is intended 
to deal with outliers: when no training data is near to a test instance, 
then the distance measure is not a proper predictor for classification. 
We present an effective method to choose an “optimal” neighbourhood 
factor for a given data set by using a guidance from a partial training 
data. In this work, we find that our algorithm maintains (sometimes 
exceeds) the outstanding accuracy of fc-NN on data sets containing 
pure continuous attributes, and that our algorithm greatly improves the 
accuracy of fc-NN on data sets containing a mixture of continuous and 
categorical attributes. In general, our method is much superior to C5.0. 

Keywords: classification, neighbourhood, emerging patterns, outlier. 



1 Introduction 

Supervised classification, where prediction is performed after training instances 
are provided, has been intensively studied in the machine learning and pattern 
recognition communities over a long period of time. Instance-based induction 
0, contrast to eager-learning based classification (as exemplified by C4.5 [1 7j ) . 
is an important approach to classification. A typical example of instance-based 
induction algorithms is the fc-nearest neighbour (fc-NN) rule ^]. Given a test 
instance T and a database of training instances, the A:-NN rule finds k training 
instances which are the nearest to T according to some kind of distance, and 
chooses the class label prevailing among these training instances. 

Recently, with the advances in data mining, supervised classification also 
becomes an interesting topic in the KDD field . Liu et al uni have proposed the 
GBA classifier based on the idea of association rules |p. Meretakis and Wuthrich 
[lltij have explored the use of frequent and long patterns in optimising posterior 
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probabilities which are used for classification. Dong, Li, Ramamohanarao et al 
have proposed the CAEP 0, JEP-C fl), and DeEPs ^3 classifiers based on 
the concept of emerging patterns |3- A common aspect of the above classifiers 
is that they make use of the support (frequency) of the interesting patterns as a 
basis to construct discriminating power, rather than the distance as used in the 
classic /c-NN rule. In this paper, we investigate how to combine the strength of 
pattern frequency and distance to solve supervised classification tasks. 

Suppose an instance T is to be classified. The basic idea of the approach 
proposed in this paper is to utilize distance as a measure to predict the class 
of T when a local area of T contains some training instances; otherwise (when 
there exists no training instance in the local area) to make use of the support of 
some subsets of T. The constrained use of distance reflects our belief that 

— when a test instance is an outlier, i.e., its local area contains no training 
instance, the fc-NN rule may not properly predict the class of the outlier. 

In this case, 

— we examine the significant support change between classes of some subsets 
of the outlier, and summarize those changes to make a decision. 

The idea of compactly aggregating support changes for classification origi- 
nated in our previous work DeEPs an instance-based classifier. In brief, the 
idea behind DeEPs is to discover those subsets of a given test instance whose 
support changes significantly from one class to another, and then base decisions 
on the supports of the discovered subsets. In this paper, we improve the support 
aggregation method of DeEPs. As a result, the improved DeEPs can properly 
handle the data sets with a very unbalanced class distribution. 

Without any prior knowledge, it is difficult to define an outlier: to define a 
point’s neighbourhood within which there is no other points. We propose here a 
method to determine an appropriate neighbourhood of a test instance by using 
partial training data as a guide. Basically, we initially set three neighbourhoods, 
and we choose one of them to be applied to all test instances if the selected one 
is the best (in terms of accuracy) for the partial training data. 

The remainder of the paper is organized as follows: Section 2 begins with a 
set of basic term definitions, followed by a brief description of fc-NN and DeEPs 
which are closely related to the current work. Our main contributions are also 
described in this section. Section 3 details our methods including selection of 
a neighbourhood, summation of pattern supports, and combination of fc-NN 
and DeEPs. Section 4 reports our experimental results on 30 widely used data 
sets. The results show that the proposed approach is generally superior to the 
performance of fc-NN and C5.0, a commercial version of C4.5 inj. Section 5 
concludes this paper. 

2 Related Work and Our Contributions 

In this section we provide basic definitions and background materials. We also 
describe a related work including fc-NN and our previous work DeEPs, each 
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followed with our new contributions made to them. These two instance-based 
classifiers are combined into a new approach. We present the new approach in 
Section 3. 



2.1 Background 

An attribute is one of the most elementary terms in relational (including bi- 
nary transactional) databases. For example, Shape can be an attribute in some 
databases. Usually, there exist at least two values for an attribute. The Shape 
attribute can have such categorical values as square, circle, and diamond. 
Another type of attribute value are called continuous values. For example. 
Height can have continuous values ranging from 0 to 3 meters. 

Item is an important notion in data mining. An item is defined as such a 
pair as attribute- value (e.g., SHAPE-square), or attribute-interval (e.g., Height- 
[0.1, 1.2)), respectively with regard to categorical or continuous attributes. An 
itemset is a set of items. The process of partitioning a value range of a contin- 
uous attribute into a number of intervals is referred to as discretization pen 
E]- Many classification algorithms could not be applied to real-world classifica- 
tion tasks unless the continuous attributes are first discretized [Z|. However, our 
methods do not need to pre-discretize data. This is one of our advantages over 
other methods. 

An instance is also defined as a set of items. Usually, different instances 
within a data set have the same number of attributes. When an instance is 
labelled with a class, the instance is called a training instance. When an in- 
stance’s class is unknown or is assumed unknown, then it is referred to as a test 
instance. The test accuracy of a classifier is the percentage of test instances 
which are correctly classified. 

The support of an itemset is used to measure the occurrence (or frequency) 
of the itemset in a data set. Given a database V and an itemset X, the support 
of X in V, denoted suppxi{X), is the percentage of instances in T> containing X. 
An itemset X is contained (or occurred) in an instance Y if X CY. 



2.2 The fc-Nearest Neighbour Rule 

With extensive theoretical analysis and rigorous empirical evaluation, the fc-NN 
rule 0 has widely attracted the attention of many researchers since its concep- 
tion in 1951 HD). The idea of fc-NN is straightforward. Given a test instance and 
for fc = 1, one finds the stored training instance which is nearest, notes the class 
of the retrieved case, and predicts the new instance will have the same class 
m- In spite of its simplicity, fc-NN is powerful in solving those classification 
tasks where all attributes are continuous. For example, on the letter-recognition 
dataset Q the fc-NN rule can reach a test accuracy of 95.58%; in comparison, the 
notable decision tree based classifier G5.0 obtains a test accuracy of only 88.06%, 

All data sets used in this paper were taken from the UCI Machine Learning Repos- 
itory 0. 
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7.5% lower than fc-NN. However, for handling those data sets which contain a 
mixture of categorical and continuous attributes, fc-NN loses its power. For ex- 
ample, on the australian data set, /c-NN and C5.0 reach 66.69% and 85.94% 
accuracy respectively. The degradation in accuracy of fc-NN on those data sets 
is mainly caused by the confusing distance contributions between categorical and 
continuous attributes. So, one of the extensively studied issues on fc-NN is how 
to justify a balance of the distance contributions by the two types of attributes. 

In this paper, our proposed method can appropriately deal with both continu- 
ous and categorical attributes. When continuous attributes are present in a data 
set, we scale the training values of every attribute into the range of [0, 1], and use 
the same parameters to scale their values in test instances. On the other hand, 
when categorical attributes are present, we transform the categorical values into 
continuous values: transform the value of any categorical attribute of a test in- 
stance into 1, and transform its different values (respectively, the same value) 
in training instances into 0 (respectively, 1). We then calculate the Euclidean 
distance. The experimental results show that our method maintains the out- 
standing accuracy of A:-NN on data sets containing pure continuous attributes, 
and that our method greatly improves the accuracy of fc-NN particularly on data 
sets containing a mixture of the two types of attributes. 

2.3 Brief Description of DeEPs 

DeEPs is a recently proposed instance-based classifier m- It makes decisions 
based on the supports of certain types of subsets of a test instance. Assume we 
are given a set T>p of positive instances and a set of T>n of negative instances, 
and a test instance T. DeEPs selects two special collections of subsets of T : one 
consists of subsets which only occur in Vp but not in ; the other of those which 
occur in but not in T>p. Then, DeEPs calculates a support summation over 
each of the two collections. DeEPs assigns T the class where a larger summation 
is obtained. 

There are a series data reduction and concise knowledge representation tech- 
niques used in DeEPs. The significant reduction is achieved by removing from 
training data the irrelevant values to a test instance: If an attribute is categori- 
cal, DeEPs removes those values of this attribute in the training data which are 
different from the value of the test instance; If an attribute is continuous, DeEPs 
removes those values which are beyond a neighbourhood of the value of the test 
instance. Table E demonstrates these points. 

In this paper, we improve DeEPs by proposing a new method to summarize 
the supports. This new method is specially designed to handle those data sets 
containing extremely unbalanced class distributions. For example, in the lym 
data set, the distribution is allocated as: 2 instances for class 1, 81 for class 2, 
61 for class 3, and 4 for class 4. Let A and B be two subsets of a test instance. 
Suppose A occurs only once in class 1, but is not present in any other classes. 
Suppose B occurs 27 times in class 2, but does not in any other classes. None of 
the other subsets has any occurrence in the four classes. Then, supp\{A) = 50%, 
and supp 2 {B) = 33.3%. Therefore, the original DeEPs would choose class 1. 
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Table 1. Original training data are transformed into binary data after removing values 
which are irrelevant to the test Instance T = {square, 0.30, 0.25}. A chosen neighbour- 
hood of 0.30 is [0.28,0.32] and a chosen neighbourhood of 0.25 is [0.23,0.27]. 



original training data 




binary data 


circle 0.31 


0.24 




0 


1 


1 


square 0.80 


0.70 




1 


0 


0 


diamond 0.48 


0.12 




0 


0 


0 



T = (square, 0.30, 0.25} 



Apparently, this decision is unreasonable as the occurrence of B in class 2 is 
much more frequent than A’s occurrence in class 1. In this work, we add a 
weight to scale down the supports contributed by small classes. This adaption 
is detailed in Section 13.31 

As mentioned in the Introduction, a more important contribution of this 
paper is to combine the use of pattern frequency and distance to strengthen 
a classifier’s discriminating power. We discuss when and how the system uses 
/c-NN rules or the system uses the DeEPs ideas. These proposals are provided 
in the following section. 

3 Our Methods 

An important preliminary step of our method is to normalise all training values of 
every continuous attribute into the range of [0, 1]. For each continuous attribute, 
we used the formula -^^here x is an original training value, max and 

min are the maximum and minimum value respectively in the training data. 
The normalisation parameters max and min are stored and will be used to scale 
the values in any test instances. 

3.1 Main Steps 

Suppose we are given a classification problem in which a data set T> contains at 
least C (C > 2) classes of data. Our methods consist of the following two main 
steps, when a test instance T is given to classify. 

(a) If a chosen neighbourhood of T covers some training instances, we apply 
3-NN to classify T. (On special situations where only two or one instance is 
covered, we apply 1-NN.) 

(b) Otherwise, when the neighbourhood does not contain any training instances, 
we apply DeEPs. Note that we consider T as an outlier in this case. 

These basic ideas are illustrated in Figure Q 

In the subsequent two subsections, we describe our methods to select proper 
neighbourhoods, and to improve DeEPs’ support summation process. 
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Fig. 1. Our classification algorithm deals with two cases when a test instance T is 
required to classify. The signs and “+” represent instances from two different 
classes. 



3.2 Selecting a Proper Neighbourhood 

We first describe a definition of a neighbourhood of T = {oi, 02 , • • • , aji}. 
Definition 1. An a-neighhourhood ofT is defined as the area: 

{X \ ai — a < Xi < ai + a if ai is continuous, and Xj = aj if aj is categorical} , 
where X = {xi,X 2 , • • • , a;„}, 1 < i < n, 1 < j < n. 



Depending on the property and complexity of training data, one would see 
that the parameter a should not be uniform for different data sets. We propose 
here a heuristic to optimize a for a given data set. Recall that one of the primary 
objectives of classification algorithms is to accurately predict the class of test 
instances. The proposed heuristic is closely related to this objective. 

We initially set three values for a: 0.05, 0.10, and 0.20. We then randomly 
choose 10% of training data and view them as “test” instances. Therefore, three 
accuracies, by using our classification method (as described in Section 1^^, on 
this special collection of “test” instances, can be obtained. Then, we apply the 
value of a, by which the highest accuracy is obtained, to the real test instances. 
This heuristic emphasizes the guidance role played by a partial training data. 
Sometimes, such a process is referred to as tuning by training data. 



3.3 An Improved Support Summation Method 

In the original DeEPs algorithms the supports of some subsets of a test in- 
stance is aggregated by the compact summation method. We review the compact 
summation using the following definition. 
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Definition 2. Let T>i be the set of instanees in T> that belong to elass i. The 
eompaet summation seore for class i of a collection S of subsets of a test instance 
T is defined as the percentage of instances in T>i that contain one or more of the 
subsets. That is: 



compacts cor e{i) 



countj}. (S) 

m 



* bias{i), 



where bias{i) = 1, and countx>i (S) is the number of instances in T>i which contain 
one or more of the subsets in S. 

The improved summation method refines the formula by multiplying a bias 
weight 1) when the class is small (i.e., Vi is small). We set bias{i) as | if 
the number of instances in Vi is less than 20 and the numbers of other classes 
data are at least three times larger than 20. This amendment can help DeEPs 
to adjust the compact scores contributed by unbalanced classes. 

Many other options for bias are possible. For example, the selection of the 
weight can rely on Bayesian rules. This is one of our future investigations. 



4 Experimental Results 

We in this section report a performance of our method in comparison to the 
performance conducted by fc-NN and C5.0. We used 30 data sets for experimental 
evaluation. For each dataset-algorithm combination, the test accuracies were 
measured by a ten-fold stratified cross validation. Each of the exclusive ten folds 
test instances were randomly selected from the original data sets. The same splits 
of the data were used for all the three classification algorithms. 



4.1 Accuracy Comparison 

Table 121 provides the data set names and the number and type of the attributes. 
(See columns 1, 2, and 3.) Columns 4, 5, and 6 show the test accuracies achieved 
by our approach, C5.0, and fc-NN respectively. Along the vertical direction. Ta- 
ble El is organized into three groups according to the performance differences 
between our method and C5.0: significant differences (> 2.00%) are in the top 
and bottom groups, while slight differences (< 2.00%) in the middle. 

We observed the following interesting points from Table |3 

— Among the 30 data sets, our method is significantly superior to C5.0 on 16 
data sets. The accuracy gaps can reach up to 14.93% (in sonar), half of them 
are around 6.5%. 

— Our method is not always better than C5.0. We lose on four data sets, 
particularly on auto. 

— On average over the 30 data sets, our accuracy is 2.18% higher than C5.0, 
and 7.27% higher than 3-NN. 
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Table 2. Accuracy comparison among our algorithm, C5.0, and fc-NN. 



Data Sets 


Numbers of Attributes 


Accuracy (%) 


Difference 




cont. 


categ. 


3-NN 


C5.0 


Ours 


ours vs. C5.0 


australian 


6 


8 


66.69 


85.94 


88.41 


-t2.47 


cleve 


5 


8 


62.64 


77.16 


83.18 


+6.02 


crx 


6 


9 


66.64 


83.91 


86.37 


+2.46 


german 


7 


13 


63.1 


71.3 


74.40 


+3.10 


heart 


6 


7 


64.07 


77.06 


81.11 


+4.05 


hepatitis 


6 


13 


70.29 


74.70 


82.56 


+7.86 


iris 


4 


0 


96.00 


94.00 


96.00 


+2.00 


letter-recog. 


16 


0 


95.58 


88.06 


95.51 


+7.45 


labor-neg 


8 


8 


93.00 


83.99 


91.67 


+7.68 


lym 


3 


15 


74.79 


74.86 


84.10 


+9.24 


pendigits 


16 


0 


99.35 


96.67 


98.81 


+2.14 


satimage 


36 


0 


91.11 


86.74 


90.82 


+4.08 


sonar 


60 


0 


82.69 


70.20 


85.13 


+14.93 


soybean-s 


35 


0 


100.0 


98.00 


100.0 


+2.00 


waveform 


21 


0 


80.86 


76.5 


83.78 


+7.28 


wine 


13 


0 


72.94 


93.35 


95.55 


+2.20 


anneal 


6 


32 


89.70 


93.59 


95.11 


+ 1.52 


breast-w 


10 


0 


96.85 


95.43 


96.28 


+0.85 


diabetes 


8 


0 


69.14 


73.03 


73.17 


+0.14 


horse-colic 


7 


15 


66.31 


84.81 


85.05 


+0.24 


hypothyroid 


7 


18 


98.26 


99.32 


98.17 


-1.15 


ionosphere 


34 


0 


83.96 


91.92 


91.08 


-0.84 


pima 


8 


0 


69.14 


73.03 


73.17 


+0.14 


segment 


19 


0 


95.58 


97.28 


96.62 


-0.66 


shuttle-s 


9 


0 


99.54 


99.65 


99.74 


+0.09 


yeast 


8 


0 


54.39 


56.14 


54.62 


-1.52 


auto 


15 


10 


40.86 


83.18 


74.04 


-9.14 


glass 


9 


0 


67.70 


70.01 


67.98 


-2.03 


sick 


7 


22 


93.00 


98.78 


96.55 


-2.23 


vehicle 


18 


0 


65.25 


73.68 


68.71 


-4.97 


Average 






78.98 


84.07 


86.25 


+2.18 



— Our method maintains (and sometimes exceeds) the outstanding discrimi- 
nating power of fc-NN when coping with the data sets containing pure con- 
tinuous attributes. Moreover, the accuracy is greatly improved from /c-NN to 
our approach when data sets contain a mixture of continuous and categorical 
attributes such as australian, cleve, crx, and so on. 

The data sets in this paper do not include those which contain pure categorical 
attributes. Assuming no duplicate instances in a data set which contains pure 
categorical attributes, then the current classification algorithm is equivalent to 
the original DeEPs m except that the current support summation method is 
refined. Our previous experimental results show that the accuracy of DeEPs is on 
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average comparable to C5.0 when handling data sets containing pure categorical 
attributes. The accuracy is sometimes better (e.g., on the tic-tac-toe data set), 
but sometimes worse (e.g., on the splice data set). 

4.2 Accuracy Variation among Folds 

The next set of experimental results are used to demonstrate the accuracy varia- 
tions among the ten folds. We chose the australian, anneal, and sick data sets as 
examples which are respectively from the three data set groups in Table 0 The 
results are summarized in Table El where “Maximum difference” represents the 
difference between the minimum and the maximum accuracy in the ten folds. 



Table 3. Ten folds accuracy variations in our method, C5.0, and 3-NN. 



Data sets 


Algorithms 


Accuracies(%) 


Standard 

deviation 


Maximum 

difference 


min 


average 


max 




ours 


81.16 


88.41 


94.29 


3.82 


13.13 


australian 


C5.0 


77.9 


85.94 


92.8 


4.91 


14.9 




3-NN 


60.00 


66.69 


74.29 


4.91 


14.29 




ours 


91.09 


95.11 


99.00 


2.27 


7.91 


anneal 


C5.0 


90.1 


93.59 


96.00 


1.66 


5.9 




3-NN 


87.13 


89.70 


93.00 


1.83 


5.87 




ours 


95.76 


96.55 


97.63 


0.63 


1.87 


sick 


C5.0 


98.10 


98.78 


99.5 


0.43 


1.40 




3-NN 


92.31 


93.00 


93.90 


0.58 


1.59 



From Table El we observe that a better classification algorithm did not re- 
markably change the standard deviation over the other’s, and did not always 
reduce the standard deviation. This indicates that a better algorithm evenly in- 
creases its accuracy on every fold. This point can be also seen from the min and 
max accuracy change trends from one algorithm to others. 

4.3 Effects of Randomisation Process and Neighbourhood Factor 

We have also conducted experiments to examine the effect of the ten-fold ran- 
domisation process on the accuracy of our method. We set five different ran- 
domisation seeds. Accordingly, the five splits over the original data sets should 
be different. Then, we applied our method to obtain five ten-fold average accu- 
racies based on the different data splits. For the australian data set, the average 
accuracy over the five splits is 87.11%, and the standard deviation is 0.71. Simi- 
larly, for the anneal data set, the average accuracy is 95.32%, and the deviation 
is 0.15; for the sick data set, the average accuracy is 96.50%, and the deviation is 
0.15. Observe that different splits can indeed produce different accuracy (though 
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slightly). Therefore, when comparing the accuracy performance of two classifi- 
cation algorithms, one should strictly apply the two algorithms to the same split 
of the original data as done in the current work. 

Our final round of experiments were conducted to show the effect of the 
neighbourhood factor a on the accuracy of our classification algorithm. We ran 
our algorithm when a varied from 0.02, 0.05, 0.08, 0.10, 0.15, to 0.20. The 
corresponding accuracies on the australian and german data sets are plotted 
in Figure 121 Note that the two curves in this figure are shaped in a different 
manner. The left curve reaches its summit when a = 0.05, and then goes down 
with increasing a values. However, the right curve starts with a decline till 
a = 0.1, and then climbs to its summit at a = 0.2. These facts strongly indicate 
that data sets have different properties. They also confirm the usefulness of our 
idea of selecting an “optimal” neighbourhood for a given data sets by using a 
guidance role played by a partial training data. 



90 - 



australian 



I I ' 

0.0 0.1 0.2 

Neighborhood Factor Value 

(a) 



74 - 



german 



0.0 



0.1 



0.2 



Neighborhood Factor Value 

(b) 



Fig. 2. Different neighbourhoods of a test instance can produce different accuracies by 
our method. Partial training data can be used to select proper neighbourhoods of test 
instances to improve performance of our classiher. (a) Accuracy varies in the range 
of [84.20%, 88.50%] on the australian data set. (b) Accuracy varies in the range of 
[73.70%, 75.00%] on the german data set. 



4.4 Discussions on Speed 

In our experiments, we found that the speed of our method was not much worse 
(on average, 1.1 times slower) than the original DeEPs. We originally thought 
that the speed of our method should be faster than the original DeEPs, and even 
better than fc-NN because 

1. fc-NN calculates the distance of a test instance to all training points, and sorts 
the distance values. However, our method limits the distance calculation only 
in the region of a neighbourhood of the test instance. Usually, the points 
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in such a neighbourhood constitutes a very small percentage of the whole 
training data. 

2. It is very efficient to locate those training data which are covered by the 
neighbourhood . 

3. The number of outliers should be small. 

In reality, with the computation of selecting an “optimal” neighbourhood 
factor and the computation required by DeEPs to make decisions for outliers, 
the speed of the proposed classifiers was not improved over the original DeEPs. 
However, the speed was generally improved over (on average, 1.2 times faster 
than) the original DeEPs when a neighbourhood was fixed. 

5 Conclusion 

We have proposed and developed a new classification algorithm which takes an 
instance-based learning strategy. The proposed method combines the advantages 
of A:-NN and DeEPs to properly treat outlier instances. We have conducted many 
experiments from different perspectives to evaluate our system. From the exper- 
imental results, one of our important observations is that the accuracy of our 
method is much higher than 3-NN and also significantly superior to (on aver- 
age over 30 data sets, 2.18% higher than) C5.0. We found that partial training 
data guidance can play a key role in selecting a suitable neighbourhood for a 
test instance, and hence can determine when the system should use /c-NN or 
DeEPs. We have also found that a better algorithm (in ours, /c-NN, and C5.0) 
did not change much of the standard deviation among the ten fold accuracies on 
a data set. This suggests that a better algorithm evenly increases its accuracy 
on every fold. As a future research issue, we will further investigate methods for 
compactly summarising supports of a collection of itemsets. 
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Abstract. Decision tree learning has become a popular and practical 
method in data mining because of its high predictive accuracy and ease 
of use. However, a set of if-then rules generated from large trees may be 
preferred in many cases because of at least three reasons: (i) large decision 
trees are difficult to understand as we may not see their hierarchical 
structure or get lost in navigating them, (ii) the tree structure may cause 
individual subconcepts to be fragmented (this is sometimes known as the 
“replicated subtree” problem), (hi) it is easier to combine new discovered 
rules with existing knowledge in a given domain. To fulfill that need, the 
popular decision tree learning system C4.5 applies a rule post-pruning 
algorithm to transform a decision tree into a rule set. However, by using a 
global optimization strategy, C4.5rules functions extremely slow on large 
datasets. On the other hand, rule post-pruning algorithms that learn a 
set of rules by the separate-and-conquer strategy such as CN2, IREP, 
or RIPPER can be scalable to large datasets, but they suffer from the 
crucial problem of overpruning, and do not often achieve a high accuracy 
as C4.5. This paper proposes a scalable algorithm for rule post-pruning of 
large decision trees that employs incremental pruning with improvements 
in order to overcome the overpruning problem. Experiments show that 
the new algorithm can produce rule sets that are as accurate as those 
generated by C4.5 and is scalable for large datasets. 



1 Introduction 

Data mining algorithms have usually to deal with very large databases. For the 
prediction data mining task, in addition to the requirements of high accurate 
and understandability of discovered knowledge, the mining algorithms must be 
scalable, i. e., given a fixed amount of main memory, their runtime increases 
linearly with the number of records in the input database. 

Decision tree learning has become a popular and practical method in data 
mining because of its significant advantages: the generated decision trees usually 
have acceptable predictive accuracy; the hierarchical structure of generated trees 
makes them are quite easy to understand if trees are not large; and especially 
the learning algorithms, which employ the divide- and- conquer (or simultaneous 
covering) strategy to generate decision trees, do not require complex processes 
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of computation. However, it happens that in certain domains the comprehen- 
sibility and predictive accuracy of decision trees decrease considerably because 
of the problem known as subtree replication [Tin (when the subtree replication 
occurs, identical subtrees can be found at several different places in the same 
tree structure). 

The solution to the problem of subtree replication in the most well-known 
decision tree learning system C4.5 m is to convert a generated decision tree 
into a set of rules using a post-pruning strategy |H|. The conversion of trees into 
rules is not only an effective way to avoid the subtree replication problem but 
also offers other significant advantages: while large trees generated from large 
datasets are difficult to understand, discovered knowledge in form of rules is 
much easier to understand. Also, in our practical experience domain experts often 
feel more comfortable to analyze and validate rules than trees if trees become 
large. Moreover, it appears that the generated rule sets usually have equal or 
higher predictive accuracy than the original decision tree. However, the C4.5rules 
algorithm is not scalable to large databases as the simulated annealing, which is 
employed to achieve an optimal generalization, requires 0{n^) time complexity 
where n is the number of records in the input database Pj. 

The separate-and-conquer (or simultaneous covering) strategy is an alter- 
native approach to learn rules directly from databases. The most well-known 
separate-and-conquer algorithms include CN2 [2|, REP fP, IREP p], RIPPER 
0, PART |S|. Among them, CN2 and REP also require a computation with high 
complexity, and therefore cannot be applicable to large data bases. IREP and 
RIPPER solve the problem of complexity by using a scheme called incremental 
pruning. The result is that they can run very fast and generate small rule sets 
with acceptable predictive accuracy. However, incremental pruning may lead to 
the problem of ouerpruning (or hasty generalization) that reduces the accuracy 
of the algorithms in many cases. PART p] is an attempt to combine divide-and- 
conquer and separate-and-conquer strategies, and was claimed to be effective 
and efficient. 

This paper concerns with scalable algorithms for rule-post pruning from large 
decision trees. In particular it proposes a solution to the problem of high complex- 
ity in C4.5rules by using a scheme similar to incremental pruning. The essence 
of the proposed algorithm is to avoid the problem of overpruning by appropri- 
ate improvements in incremental pruning. Experiments show that the proposed 
algorithm produces rule sets that as accurate as those generated by C4.5 and is 
scalable for very large data sets. 

2 Related Works 

A variety of approaches to learning rules have been investigated. One is to begin 
by generating a decision tree, then to transform it into a rule set, and finally to 
simplify the rules (the divide- and- conquer strategy as used in the system C4.5 
CH). Another is to use the separate-and-conquer strategy [Il)j to generate and 
an initial rule set, then applying a rule pruning algorithm. 
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2.1 Rule Post-pruning in C4.5 

The rule learner in C4.5 does not employ a separate-and-conquer method to 
generate a set of rules — it achieves this by simplifying an unpruned decision tree 
using the decision tree inducer included in the C4.5 software. Then it transforms 
each leaf of the decision tree into a rule. This initial rule set will usually be very 
large because no pruning has been done. Therefore C4.5 proceeds to prune it 
using various heuristics. 

First, each rule is simplified separately by greedily deleting conditions in order 
to minimize the rule’s estimated error rate. Following that, the rules for each 
class in turn are considered and a “good” subset is sought, guided by a criterion 
based on the minimum description length principle. The next step ranks the 
subsets for the different classes with respect to each other to avoid conflicts, and 
determines a default class. Finally, rules are greedily deleted from th whole rule 
set one by one, so long as this decreases the rule set’s error on the training data. 

Unfortunately, the global optimization process is rather lengthy and time- 
consuming. Cohen Pj shows that C4.5 can scale with the cube of the number of 
examples on noisy datasets. 



2.2 Other Related Rule Pruning Algorithms 

The earliest approaches to pruning rule sets are based on global optimization. 
These approaches build a full, unpruned rule set using a separate-and-conquer 
strategy. Then they simplify these rules by deleting conditions from some of 
them, or by discarding entire rules. The simplification procedure is guided by a 
pruning criterion that the algorithm seeks to optimize . The optimized solution 
can only be found via exhaustive search. In practice, some heuristic searches are 
applied, but they are still quite time consuming. 

There is a faster approach to rule pruning called incremental pruning that is 
introduced first in IREP [7|, and also used in RIPPER and RIPPERfc P|. The 
key idea is to prune a rule immmediately after it has been built, before any new 
rules are generated in subsequent steps of the separate-and-conquer algorithm. 
By integrating pruning into each step of the separate-and-conquer algorithm, 
this approach can avoid the high complexity of a global optimization process. 



2.3 The Problem of Overpruning 

Although the incremental pruning used in IREP (and its variants) avoids the 
high complexity process of global optimization of C4.5rules, it may suffer the 
problem of overpruning or hasty generalization jS|. By using pre-pruning ap- 
proach, the algorithm does not know about potential new rules when it consider 
a rule to prune; the pruning decisions are based on the accuracy estimation of 
the current rule only. In other words, the algorithm cannot estimate how the 
pruning decisions on a single rule will effect the accuracy of the whole final rule 
set. Therefore, it may happen that the pruning decisions increase the accuracy 
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Table 1. Potential rules in a hypothetical dataset 





Rule 




Coverage 








Growing Set Pruning Set 


1: A = 


true — >■ yes 


600 


60 


200 


20 


2: A = 


false A B = true — >■ yes 


1200 


60 


400 


20 


3: A = 


false A B = false — ^ no 


0 


30 


0 


10 



of the current rule but may in fact decrease the accuracy of the potential final 
rule set on the same estimation. 

Table 1 shows a simple example of overpruning. The example is taken from 
p] with a modification to make it easier to calculate. Consider a binary dataset 
with two attributes A and B; examples can belong to either class yes or no. 
There are three potential rules on the dataset as shown on the table. Assume 
that the algorithm generates first rule 

A = true — >■ yes. (1) 

Now consider whether the rule should be further pruned. Its error rate on 
the pruning set is 1/10, and the pruned rule 

^ yes (2) 

has an error rate of 1/13, which is smaller, thus the rule will be pruned to that 
null rule. As the null rule covers all the data the algorithm stops and satisfies 
with a final rule set consisting only of that trivial rule. But the found rule set 
actually has a greater error rate compare comparing to 4/65 which is the error 
rate of the set three rules showed in the table. Note that this happens because 
the algorithm concentrates on the accuracy of rule 1 when pruning — it does not 
make any guesses about the benefits of including further rules in the classier. 

3 A Scalable Algorithm for Rule Post-pruning 

As an attempt to solve the high complexity problem of C4.5rules we have devel- 
oped a new algorithm that adopts a scheme similar to incremental pruning used 
in IREP, we have named it CABROrule as it is integrated in our decision tree 
learning CABRO 0. Similar to C4.5rules, CABROrule uses a bottom-up search 
instead of IREP’s top-down approach: The final rule set is found by repeatedly 
removing conditions and rules from an input unpruned rules rather than adding 
new rules to an initial empty set. In the other words, CABROrule uses post- 
pruning approach in contrast to pre-pruning one used in IREP. By taking the 
advantage of working with a full grown rule set throughout the pruning process, 
we can improve the incremental pruning scheme in CABROrule to avoid the 
problem of hasty generalization. 
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Table 2. The Main Procedure of CABROrule 



procedure C ABROruleiUnprunedSet, Data) 

PrunedSet 0 
while {Data ^ 0) 

Rule •<— SelectRule{UnprunedSet, Data) 

UnprunedSet •<— UnprunedSet \ {Rule} 

PrunedRule <— PruneRule{Rule, UnprunedSet, Data) 
PrunedSet <— PrunedSet U {PrunedRule} 

Data •<— Data \ Match{PrunedRule, Data) 
return PrunedSet 



3.1 Description of the Algorithm 

Similar to C4.5rules, CABROrule begins with a set of unpruned rules. The rule 
set is taken directly from an unpruned decision tree where each rule corresponds 
to a path from the tree root and a leave node. To prune the rule set, CABROrule 
follows a separate-and-conquer strategy: first choosing one rule to prune at a 
time, then removing the covered examples and repeating the process on the 
remaining examples. Table 2 shows the main procedure of CABROrule, the 
procedure to prune a single rule is in Table 3. 

The efficiency of CABROrule comes from the avoidance of the process of 
searching for an “optimized” subset of rules such as the one in C4.5rule. We will 
analyze the reason why C4.5rules requires a global optimization but CABROrule 
does not. In C4.5rules, each individual rule is pruned with respect to the aZZ train- 
ing data. Deleting conditions from a rule — and thereby increasing its coverage — 
ultimately may result in a rule set with many overlaps. The optimized exclusive 
subset of rules can only be found via exhaustive search. In practise, exhaustive 
search is infeasible and C4.5rules apply some heuristic approximations (two al- 
ternatives in C4.5rules are greedy search and simulated annealing), but even 
these approximate algorithms are quite time consuming. In contrast, each single 
rule in CABROrule is pruned with respect to the remaining training data af- 
ter removing all examples covered by previous pruned rules. The exhaustion of 
training data serves as a stop condition; and the final rule set has no overlap. It is 
noticeable that while there is no natural order for rules generated by C4.5rules, 
CABROrule generates ordered rule sets those sometimes are known as decision 
lists |T^ . 

To calculate the complexity of CABROrule, we assume that the data set con- 
sists of n examples described by a attributes. To choose a condition to prune the 
procedure PruneRule needs to examine all the conditions of the considered rule 
each require n tests on the training examples. Therefore complexity of pruning 
a condition is 0{an). Suppose the length of an unpruned rule is a, then pruning 
a rule requires 0{a?'n). If we assume that the size of the final theory is constant 
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Table 3. Pruning a Single Rule 



procedure PruneRule{Rule, UnprunedSet, Data) 

repeat 

Accuracy EstimateAccuracy({Rule\ U UnprunedSet, Data) 

Delta Accuracy A •«— 0 
for each {Condition G Rule) do 
New Rule •<— Rule \ Condition 

New Accuracy EstimateAccuracy{{NewRule} U UnprunedSet, Data) 
NewDeltaAccuracy •<— New Accuracy — Accuracy 
if {NewDeltaAccuracy > Delta Accuracy) 

Delta Accuracy NewDeltaAccuracy 

BestCondition <— Condition 
if {Delta Accuracy > 0) 

Rule ■<— Rule \ Condition 
until {Delta Accuracy <0) 
return Rule 



|E], the complexity will be linear to n. Because a decision tree can be built in 
time 0{nlogn), the overall cost to build a rule set from data is 0{nlogn). The 
complexity is the same as that of PART |S| and better than 0{nlog^n) of IREP 
or RIPPER, or 0{n^) of C4.5rules. 

Before going to the next subsection which addresses the problem of overprun- 
ing we discuss briefly about the procedure SelectRule in CABROrule. There is 
only a number of input rules those have their chance to be considered to prune 
and add to the final rule set. Certainly, we want as many “significant” rules hav- 
ing that chance as possible. A measure is necessary to judge the “significance” 
of a rule. In general, there are several existing measures those may be suitable 
for that purpose such as relative frequency, m-estimate of accuracy, or entropy 
P). However, when we apply CABROrule on an unpruned decision tree resulting 
from C4.5 we use the coverage of a rule as a criterion to select which rule will be 
prune first. That because when growing a decision tree, C4.5 already optimized 
each path (corresponding to an unpruned rule) by information gain. Therefore, 
it is reasonable that rules with larger coverage may be more important and need 
to be considered first in the pruning process. If CABROrule is applied on rule 
sets that grown by other algorithms, other criteria for selecting rules can be 
better choices. 



3.2 Avoiding Overpruning 

CABROrule uses a greedy search algorithm for pruning a single rule. At a time, 
the algorithm searches for a condition to prune. The pruning continues until the 
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accuracy estimation cannot be improved anymore. A description of the algorithm 
for pruning a single rule is in Table 3. 

To overcome the problem of overpruning in the original algorithm of incre- 
mental pruning used in IREP and its variants, CABROrule takes a different 
approach to estimate the accuracy when making pruning decisions. Instead of 
estimating the accuracy only on the rule under consideration, the procedure 
Estimate Accuracy does estimation on that rule together with all remaining un- 
pruned rules. As we have stated in the previous section, the overpruning occurs 
when pruning decisions on a rule improving the accuracy estimation of that rule, 
but in fact potentially reducing the accuracy of the final rule set. By taking into 
account of remaining unpruned rules when pruning a single rule, we can make 
sure that a condition will be pruned if that potentially improves the accuracy 
on the whole final rule set not only on that single rule locally. 

We return to the example of overpruning in section 2 to illustrate the new 
approach. Assume that the CABROrule considers pruning rule 1 back to a null 
rule. Instead of estimating the accuracy of only rule 1 before and after pruning it, 
the algorithm does estimation on all three rules to make that pruning decision. 
From Table 2, we can see that after pruning rule 1, the accuracy estimation is 
1/13 which less than 4/65 before the pruning decision. Therefore the algorithm 
cancels that pruning decision and avoids a case of overpruning. 

The problem of overpruning or hasty generation is not restricted to a par- 
ticular method of accuracy estimation , and our solution to the problem does 
not depend on estimation methods. The estimation of reduce error pruning is 
used in the example only to make calculations easier. In CABROrule we uses 
pessimistic estimation PH similar to the one used in C4.5 to estimate the ac- 
curacy of a rule set. The estimation is done by calculating the rule accuracy 
over the training data, then calculating the standard deviation in this estimated 
accuracy assuming a binomial distribution. For a given confidence level (we used 
95% in our experiments), the lower-bound estimate is then taken as the measure 
of rule performance. The accuracy estimate of a rule set is the average of the 
estimates over its members with respect to their coverage on the data set. 

4 Experimental Results 

In order to evaluate the performance of CABROrule we designed two experi- 
ments. The first experiment evaluates the predictive accuracy of CABROrule 
comparing to C4.5 and C4.5rules, the second evaluates the run-time of 
CABROrule comparing to C4.5rules. 

For the first experiment we used 31 standard datasets from UCI collection. 
The datasets and their characteristics, together with experimental results are 
listed in Table 4. We performed 10-fold cross-validation on these datasets with 
C4.5, C4.5rules and CABROrule. The same folds were used for each program. A 
numbers in the result columns is the average of error rates or size of rule sets over 
ten times of running. A symbol in the last column indicates that CABROrule 
has an error rate lower than both C4.5 and C4.5rules on that dataset, while a 
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Table 4. Experimental Resnlts 



Dataset 


#Exam NumAtt NomAt Class 


C4.5 

size error 


C4.5rules 
size error 


CABROrule 
size error 




anneal 


898 


6 


32 


5 


60.0 


4.3 


11.3 


4.4 


12.0 


3.1 


• 


andiology 


226 


0 


69 


24 


49.0 


9.3 


21.0 


8.8 


22.0 


8.8 




anstralian 


690 


6 


9 


2 


34.5 


15.2 


13.2 


16.2 


9.6 


14.5 


• 


anto 


205 


15 


10 


6 


68.7 


19.5 


22.2 


18.4 


20.8 


20.4 


X 


balance-scale 


625 


4 


0 


3 


82.0 


22.1 


37.0 


21.1 


28.7 


22.2 


X 


breast 


699 


9 


0 


2 


27.4 


5.1 


9.0 


4.6 


8.2 


4.9 


X 


breast-cancer 


286 


0 


9 


2 


12.1 


25.9 


7.8 


29.7 


3.0 


26.2 


o 


german 


100 


7 


13 


2 


86.0 


29.7 


19.7 


29.6 


14.6 


29.1 


• 


glass 


214 


9 


0 


6 


44.0 


32.7 


13.8 


30.8 


13.6 


30.8 




glass2 


163 


9 


0 


2 


23.4 


21.9 


8.1 


20.2 


8.0 


20.2 




heart 


303 


6 


7 


2 


24.0 


12.2 


8.8 


14.4 


7.2 


11.5 


• 


hepatitis 


155 


6 


13 


2 


18.6 


25.2 


8.4 


20.1 


6.7 


21.4 


X 


horse-colic 


168 


7 


15 


2 


8.2 


15.2 


5.9 


16.0 


4.1 


14.7 


• 


hypothyroid 


3772 


7 


22 


4 


12.2 


0.6 


6.0 


0.6 


5.1 


0.6 




ionosphere 


351 


34 


0 


2 


25.0 


9.4 


9.1 


8.8 


9.2 


8.8 




iris 


150 


4 


0 


3 


8.8 


5.3 


4.1 


4.6 


4.0 


4.6 




labor-neg 


57 


8 


8 


2 


5.7 


19.3 


4.0 


21.0 


2.6 


19.3 


o 


lymphography 


148 


3 


15 


4 


26.9 


22.8 


9.6 


22.8 


9.2 


22.9 


X 


mushroom 


8124 


0 


22 


2 


29.7 


0.0 


17.0 


0.0 


16.9 


0.0 




pima 


768 


8 


0 


2 


45.2 


25.7 


10.7 


26.3 


9.9 


25.7 


o 


primary-tumor 


339 


0 


17 


21 


77.8 


59.3 


17.1 


60.2 


13.9 


59.9 


o 


segment 


2310 


19 


0 


7 


87.0 


2.8 


28.2 


3.7 


27.6 


3.7 




sick-euthyroid 


372 


7 


22 


2 


24.6 


2.2 


12.0 


2.4 


9.2 


2.2 


o 


sonar 


208 


60 


0 


2 


27.2 


28.9 


8.7 


30.3 


8.7 


30.3 




soybean-large 


683 


0 


25 


19 


94.9 


7.8 


35.8 


7.0 


3.1 


6.7 


• 


splice 


3190 


0 


61 


3 


220.2 


5.9 


74.1 


6.5 


60.0 


6.0 


o 


vehicle 


840 


18 


0 


4 


135.8 


28.7 


26.6 


27.1 


25.9 


26.8 


• 


vote 


435 


0 


16 


2 


13.0 


6.0 


6.4 


5.3 


5.1 


6.2 


X 


waveform-21 


301 


21 


0 


3 


542.2 


23.2 


68.1 


22.4 


67.6 


22.3 


• 


waveform-40 


5002 


34 


0 


3 


584.6 


24.9 


66.1 


23.4 


68.0 


23.0 


• 


zoo 


101 


1 


15 


7 


17.4 


7.6 


7.8 


7.6 


7.8 


7.6 





“o” indicates that CABROrule has an error rate lower than C4.5rules, and a 
“x” indicates that C4.5rules has an error rate lower than that of CABROrule. 

We can observe from Table 4 that CABROrule outperforms C4.5rules on 
15 over 31 datasets, among them there are 9 datasets CABROrule outperforms 
both C4.5rules and C4.5, whereas C4.5rules has a lower error rate comparing to 
CABROrule on 6 datasets (totally C4.5rules has lower error than C4.5rules on 
15 datasets, while is with higher error rate on 4 datasets). In some datasets the 
differences between error rates are too small to say that they are significant, but 
this experiment showed that CABROrule at least as good as C4.5rules if not 
better in the predictive accuracy. 
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-□-C4.5rules 

-o-CABROrule 



Fig. 1. Comparison of Running Time 



About the size of rule sets, CABROrule generated smaller rule sets on a 
major number of datasets comparing to C4.5rules, and both reduce the number 
of rules comparing to C4.5 substantially. That reduction, in many case, will 
increase the understandability of result models, and this experiment reconfirms 
the advantage of transforming decision trees to rules. 

In order to evaluate the efficiency of CABROrule, the second experiment 
is done with the census-income dataset. We began with 10000 examples and 
repeatedly ran CABROrule and C4.5rules, each time with a bigger number of 
examples, to learn in what order the run-time increases according to the size 
of data. Figure 1 is the graph drawn from the experiment results. This graph 
confirms and illustartes that the run-time of C4.5rules is higher than O(n^), while 
the run-time of CABROrule is about 0{nlogn) that confirms our calculation 
about the algorithm complexity in the previous section. 

Some significant conclusions can be drawn from these two experiments: 

— By using incremental pruning approach to post-pruning problem, 
CABROrule can reduce the run-time substantially in comparion to 
C4.5rules. It allows us to apply the algorithm to large datasets those are 
very common in data mining. 

— There is no lost in criteria of predictive accuracy and model size. In fact, 
there is some gain in accuracy, and CABROrule usually generates smaller 
rule sets in comparion to C4.5rules. 

— Transferring decision trees to rules may increase both understandability and 
predictive accuracy of models. 
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5 Conclusion 

This paper has presented a new algorithm for rule post-pruning of decision trees. 
It can be considered an alternative algorithm for C4.5rules when the input data 
become very large. The problem of high complexity in C4.5 is solved by adopt- 
ing an incremental pruning scheme. However the algorithm does not suffer the 
problem of hasty generalization such as in the original incremental pruning ap- 
proach. Experiments have shown that the new algorithm generates rule sets as 
accuracy as those of C4.5 but with far less time of computation. 
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Abstract. The alternating decision tree brings comprehensibility to the 
performance enhancing capabilities of boosting. A single interpretable 
tree is induced wherein knowledge is distributed across the nodes and 
multiple paths are traversed to form predictions. The complexity of the 
algorithm is quadratic in the number of boosting iterations and this 
makes it unsuitable for larger knowledge discovery in database tasks. In 
this paper we explore various heuristic methods for reducing this com- 
plexity while maintaining the performance characteristics of the original 
algorithm. In experiments using standard, artificial and knowledge dis- 
covery datasets we show that a range of heuristic methods with log linear 
complexity are capable of achieving similar performance to the original 
method. Of these methods, the random walk heuristic is seen to out- 
perform all others as the number of boosting iterations increases. The 
average case complexity of this method is linear. 



1 Introduction 

Highly accurate classifiers can be found using the boosting procedure but as 0 
discovered for standard decision trees the combination of each classifier produced 
at each iteration into a single classifier is multiplicative. Freund and Mason ^ 
introduced a method capable of inducing a single classifier without the exponen- 
tial growth in tree size by changing the representation of the underlying tree. 

Standard decision trees have interior nodes that perform tests on the data 
and leaf nodes labelled with class values. Classification is achieved by following 
the unique path from the root to a leaf for a given unknown instance. The 
alternating decision tree introduces a new node called a predictor node which 
can be either an interior or a leaf node. The tree has a predictor node at its root 
and then alternates between test node and further predictor nodes, hence the 
name. Classification is achieved by summing the contributions from the predictor 
nodes of all paths that an instance successfully traverses. A positive sum implies 
membership of one class and a negative sum membership of the other. While 
the original algorithm was restricted to two class problems it appears that the 
algorithm can be extended to multiclass problems by using the framework of 0 . 

Each boosting iteration adds a test (weak hypothesis) and two predictor 
nodes to the tree. The test chosen to extend the tree is the one that minimizes a 
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function that measures the “impurity” of the test. The tree can be extended from 
any of its existing predictor nodes which means that for each boosting iteration 
the minimization function must be computed for each possible test, i.e. the 
algorithm is quadratic in the number of boosting iterations. This paper explores 
heuristic methods for restricting the number of predictor nodes that need to 
be examined for the possible addition of new test nodes. By maintaining the 
performance levels of the original algorithm and reducing its complexity we aim 
to demonstrate that it is possible to produce a practical form of the alternating 
decision tree that can be applied to larger knowledge discovery tasks. 

The paper is organized as follows. In the next section we outline our inter- 
pretations of the original algorithm (some aspects of the induction of alternating 
decision trees were not clearly defined in the original paper). Section 3 looks at 
ways in which the original algorithm can be made more efficient with no loss 
of performance. While these techniques improve the algorithm they do not have 
any effect on the overall complexity, and so Section 4 introduces three heuristic 
search mechanisms for constructing useful paths in the tree without exploring 
all tests at each predictor node. Each of these methods is log-linear in the worst 
case. Section 5 outlines an experiment to determine the efficacy of the heuristic 
methods compared to the original. Accuracy, runtime, and “shallowness” of the 
tree are measured for a range of standard, artificial and knowledge discovery in 
database datasets. Shallowness is measured as the number of leaves in the tree. 
This measure gives a picture of the effect the heuristics have on the overall shape 
of the trees that they prune. Section 6 provides a discussion of the results and 
outlines some avenues for further work. 



2 Inducing Alternating Decision Trees 



Alternating decision trees provide a mechanism for combining the weak hypothe- 
ses generated during boosting into a single representation. Keeping faith with 
the original implementation, we use inequality conditions that compare a single 
feature with a constant as the weak hypotheses generated during each boosting 
iteration. In ^ some typographical errors and omissions make the algorithm 
difficult to implement so we include below a more complete description of our 
implementation. 

At each boosting iteration t the algorithm maintains two sets, a set of pre- 
conditions and a set of rules, denoted Vt and TZt, respectively. A further set C 
of weak hypotheses is generated at each boosting iteration. 

Initialize. Set the weights associated with each training instance to 1. Set the 
first rule TZi to have a precondition and condition which are both true. Calculate 
the prediction value for this rule as a = | In where W+(c), W-{c) are the 

total weights of the positive and negative instances that satisfy condition c in 
the training data. The initial value of c is simply True. 
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Pre-adjustment. Reweight the training instances using the formula 



(for two class problems, the value of yt is either +1 or -1). 
Do for t = 1, 2, . . T 



1. Generate the set C of weak hypotheses using the weights associated with 
each training instance Wi^t 

2. For each base precondition ci G Vt and each condition C 2 G C calculate 

^t(ci,C2) = 2 ^a/W+(ci a C2)kF-(ci A C 2 ) + 

\/VF+(ci A -iC2)VF_(ci a “'C 2 ) ) + W(-iCi) 



3. Select ci,C 2 which minimize Z{ci,C 2 ) and set TZt+i to be TZt with the 
addition of the rule rt whose precondition is ci, condition is C 2 and two 
prediction values are: 



1 IF+(ciAc 2) + 1 , 1, IF+(ci A - 102 ) + 1 

a = - In ; r , 0 = - In ^ ^ 

2 IF_(ciAc2) + 1 2 IF_(ci A-.C2) + 1 

4. Set Vt+i to be Vt with the addition of C\ A C 2 and C\ A -iC 2 . 



5. Update the weights of each training example according to the equation 



Wi.t+i 



Wi^te 



-rt(xi)yt 



Output the classification rule that is the sign of the sum of all the base rules in 
Rt+1- 

/ T 

class{x) = sign E rt{x) 

The best value of T for stopping the boosting process is still an open research 
question. In ^ the value is decided by cross-validation. In this paper we look at 
the effects of heuristics on fixed values for T. 

Figured depicts a sample alternating decision tree. A hypothetical example 
with attribute values A1 = true and A2 = false would be classified according 
to the following sum derived by going down all appropriate paths in that tree 

collecting all prediction values encountered: 0.5 H 1.2 -I 3.4 -|- 0.2 = —3.9 

(indicated by horizontal arrows in the figure). 
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The changes to the original algorithm are fairly minor - the pre-adjustment 
phase may have been “implicitly” defined, and the change in the Zt formula to 
represent all instances that do not satisfy the precondition must be typographical 
as is the missing minus sign in the updating phase. The formulas for the newly 
generated predictor nodes in stage 3 have a unit value added to avoid zero- 
frequency problems [S|. 




Fig. 1. A sample ADTree. The horizontal arrows indicate all predictor nodes enconn- 
tered when classifying an example with A\ = true and A2 = false. 



3 Optimizing the Original Algorithm 

The algorithm described in Section 0 is quadratic in the number of boosting 
iterations because the calculation of the Z-value for each of the set of C weak 
hypotheses is performed at every predictor node in the tree. While this complex- 
ity is unavoidable, there are ways to avoid performing the Z-value calculation 
unnecessarily. We call this value Zpure{c). It is the best possible Z-value that 
would result from a pure split of the training instances under consideration. 

Zpure(c) = 2(^W+(C) + VW-(C)) + W(^Ci) 

Straightforwardly using the formula for Z given in the previous section would 
have yielded a lower bound of Zpure = W{->ci) only. But as we adjust all weight 
sums in a way reminiscent of the Laplace correction, we are able to derive this 
more stringent lower bound. Zpure is a lower bound on the Z-value a good test 
could possibly achieve 0 . So if Zpure for some predictor node in the tree is worse 

^ We omit the proof here, but basically one has to show that t he following i n- 
equality holds for all a, b,c,d > 0: v^o+lT+T + y/c -\- d -\- 1 < -|- l)(c -I- 1) -I- 

\J [b + l)(d -I- 1). 



Optimizing the Induction of Alternating Decision Trees 



481 



than the best test found so far, we do not need to evaluate any test at this node. 
Furthermore, one can show that all possible tests at all successor nodes of this 
node are also bounded by the same Zp^re^ as they involve subsets of the current 
set of examples only. Therefore we can omit evaluating the complete subtree 
rooted at a node cutoff by Zpure- 

Duplicated tests have been identified as another source of unnecessary ineffi- 
ciency. Especially with larger numbers of boosting iterations (100 and 200 in our 
experiments reported below) duplicated tests are reasonably common to justify 
special attention. If a predictor node is the root of two identical tests, both tests 
will induce identical subsets when searching for the next best test to add to the 
tree. Thus we will duplicate work unnecessarily. Fortunately, there is a simple 
remedy for the problem: when adding a new test to a predictor node, we simply 
need to check whether exactly the same test is already present at this node. If 
so, we just merge the old test with the new one by adding the respective pre- 
diction values. This procedure results in exactly the same predictive behaviour 
of the induced alternating decision tree due to its additive nature. But when 
determining the next best test we save time by traversing a smaller tree. 

The effects of both the Zpure cutoff and the merging of tests have been studied 
experimentally and are discussed in Section El Summary results are depicted in 
Figure 0 



4 Heuristic Search Variants 

Even though both methods described in the previous section do improve the 
efficiency of the algorithm, they do not alter its quadratic nature in general. 
Determining the next best test to add still involves looking at (almost all) current 
predictor nodes and for each of those evaluating all possible tests. With every 
new test, i.e. at every boosting iteration where we are not able to merge tests, we 
add two more predictor nodes to the tree. A way of reducing the total complexity 
is to limit the search to just a subset of all predictor nodes, hopefully including 
the node that would have yielded the next best test using the exhaustive search 
of the original induction algorithm. 

Figure 0 demonstrates the heuristic we have chosen to investigate here. In- 
stead of recursively exploring the complete tree, we limit search at each boosting 
iteration to just one path down the tree. Obviously this must reduce the com- 
plexity, as now we will only be exploring a logarithmically-sized subset of all 
predictor nodes. Additionally, this procedure seems to yield more shallow trees 
on average, thus improving efficiency further. On the other hand, such heuristi- 
cally induced trees are different from the original trees, so we will have to explore 
whether we are trading off gains in efficiency for worse predictive error or less 
comprehensible trees, or even both. 

The next section will empirically explore these questions, but first we need 
to define the heuristics used for determining the particular paths to be explored. 
Basically, we would like to have a good chance of including the node with the 
best test. In order to achieve this we invented the first two of the following three 
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Fig. 2. The exhaustive method (a) has to evaluate all possible additional tests for 
all predictor nodes. Going down just one path (b) considerably reduces the number of 
tests to evaluate. 



heuristics. The third heuristic was initially added to simply function as a bottom 
line for comparisons, but turned out to perform pretty well in practise, too. 

1. Heaviest path: looking at the formula for Z we see that larger sets of more 
important examples, i.e. “heavier” sets can lead to a larger reduction, pro- 
vided we find a test that separates both classes reasonably well. Therefore 
this heuristic always follows the path of the heaviesiH subset of examples. 

2. Best possible Z^^re.'- reflecting on the previous heuristic we see that it some- 
times might lead us astray. The heaviest subset could consist of large a 
number of examples of just one class, so every conceivable split would still 
not result in a particularly good Z value. Consequently, this heuristic chooses 
to follow down the path of the subset with the smallest possible value for 
Zpure- Clearly, it too cannot provide any guarantees on whether we will be 
able to find such a split performing as well as theoretically possible. 

3. Random walk: this heuristic is a bottom line for comparison and it involves 
the least computational effort of all. Interestingly, as its choices are purely 
random, it will explore all paths with equal probability. So every single path 
has a fair chance of being chosen for evaluation at some boosting iteration 
or the other. 

No matter which method we choose for selecting a single path, we will al- 
ways be exploring considerably less predictor nodes, which should at least result 
in considerable savings in terms of time needed for induction. We will try to 
quantify these savings empirically in the next section. 

^ Heaviest is literally correct here as we sum the weights of the subset of examples at 
a node to determine how heavy it is. 
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5 Experiments and Results 

This section compares the performance of the original optimized algorithm of 
Section 3 with the heuristic variants described in Section 4. The methods were 
compared for accuracy, runtime and the number of leaves they produced in the 
resulting tree. 

The datasets and their properties are listed in Table 1. The first sixteen are 
taken from the UCI repository [Q. These datasets were evaluated using a single 
ten-fold cross validation. The remaining datasets are labelled as “Knowledge 
Discovery” datasets and come from two sources. The sets called adult and coil 
are from the KDD section of the UCI repository while the artificial datasets 
artl, art2, and art3 were generated using a technique described in jS|. Due to 
their size, these datasets were evaluated using a single train and test split. The 
table lists the respective train and test set sizes for these cases. One aim of the 
experiments is to show that the effects of the heuristics scale well with the data. 

Figure 3a shows the effect on average relative runtimes of optimizing the 
original algorithm by merging common branches and employing the Zp^re cutoff 
across all the UCI datasets in Table 1. The figures for the four variations are 



Table 1. Datasets used for the experiments. 



Dataset 


Instances 


Missing 


Numeric 


Nominal 






values (%) attributes 




UCI Datasets 


breast-cancer 


699 


0.2 


9 


0 


Cleveland 


303 


0.2 


6 


7 


credit 


690 


0.6 


6 


9 


diabetes 


768 


0.0 


8 


0 


hepatitis 


155 


5.4 


6 


13 


hypothyroid 


3772 


5.4 


7 


22 


ionosphere 


351 


0.0 


34 


0 


kr-vs-kp 


3196 


0.0 


0 


36 


labor 


57 


33.6 


8 


8 


mushroom 


8124 


1.3 


0 


22 


promoters 


106 


0.0 


0 


57 


sick-euthyroid 


3163 


6.5 


7 


18 


sonar 


208 


0.0 


60 


0 


splice 


3190 


0.0 


0 


61 


vote 


435 


5.3 


0 


16 


votel® 


435 


5.5 


0 


15 


KDD Datasets 


coil 


5822/4000 


0.0 


85 


0 


adult 


32561/16281 


0.2 


6 


8 


artl 


50000/50000 


0.0 


0 


50 


art 2 


50000/50000 


0.0 


25 


25 


arts 


50000/50000 


0.0 


50 


0 
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shown relative to the original algorithm runtime at 10 iterations. The variations 
are, the original algorithm with no optimization, the original merging common 
branches only, the original employing the cutoff only and finally the origi- 

nal using both optimizations. For numbers of boosting iterations up to 50 there 
is little to be gained by these methods, but beyond 50 significant gains can be 
made, particularly by merging. The biggest reduction occurs at 200 iterations 
when both optimizations are used, making an approximate average runtime sav- 
ing of around one third. 

Figure 3b charts the average relative runtimes across the same datasets com- 
paring the original optimized algorithm with the three heuristic methods de- 
scribed in Section 4. The figures for heuristic improvements are relative to “ran- 
dom search” at 10 iterations. The relative differences in performance are only 
negligible at 10 iterations. Beyond this value the heuristic methods are clearly 
superior. The random walk method especially is twice as fast as the other two 
heuristic methods at all iterations, and an order of magnitude faster than the 
original algorithm at 100 and 200 iterations. In general, the heaviest path and 
Zpure heuristics have a rather similar runtime behaviour, sometimes they even 
induced identical trees. The random walk method follows a runtime curve which 
we suspect is linear. A possible explanation for this surprising average case be- 
haviour is given in the next section. 



Average relative runtimes 



Average relative runtimes 




Search 

i Original 
Merging 
CutZpure 
Merge&Cut 




Search 
I Exhaustive 
I MaxWelght 

I ^ure 

1 Random 



Fig. 3. Average relative runtimes for (a) variations on the original algorithm and (b) 
the various heuristic search methods. 



The runtime performance of the heuristic variants is only relevant if there 
is no appreciable degradation in predictive accuracy for these methods when 
compared to the original. Figure 4a shows their performance relative to the 
original. All error figures are shown relative to the original at 10 iterations. It 
can be seen that for 10 iterations the heuristic methods fail to produce the same 
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performance as the original, particularly the random walk heuristic which is up 
to 20% worse. At 50 iterations the gap has closed considerably and beyond 50 
the random walk method actually outperforms the original algorithm. The other 
two heuristics again have the same relative performance which is consistently 
slightly worse than the original. One explanation for the superior performance 
of the random walk method is that it may avoid overfitting due to “natural” 
pruning, but this is only an hypothesis. 

The number of leaves (predictor nodes) produced by the various methods 
give some indication of the shape of the trees being produced by the heuristic 
methods. Figure 4b shows these leaf figures relative to the original algorithm 
at the respective number of iterations. As can be seen, the heuristic methods 
have significant numbers of additional predictor nodes relative to the original 
method. This is a clear indication that these trees are more shallow, i.e. that 
they contain more but shorter paths on average. To understand this result we 
need to visualize the possible shapes of an alternating decision tree. For a fixed 
total number N of predictor nodes the minimum number of leaves ^ is achieved 
by a perfectly binary tree. The maximum number A^ — 1 is achieved by a flat list 
of tests; such a totally flat alternating decision tree is actually equivalent to an 
ensemble of boosted decision stumps. Thus a higher number of leaves indicates 
a more decision stump-like tree shape. 



Average relative predictive error 



Average relative number of leaves 




Boosting iterations 



Boosting iterations 



Fig. 4. Average relative accuracies and number of leaves for the various heuristic search 
methods. 



6 Conclusions and Further Work 

This paper has presented an improved version of the original alternating decision 
tree algorithm of Freund and Mason. This improved method is still quadratic in 
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the number of boosting iterations and as such is not particularly useful for knowl- 
edge discovery. The use of heuristics to speed up the algorithm was investigated 
and we have shown that it is possible to achieve results similar, and occasionally 
better than the original, particularly for large numbers of boosting iterations. In 
terms of runtime, all heuristic methods were superior. Informal analysis would 
indicate that all heuristic methods have O(nlogn) worst case complexity, but 
that they enjoy 0{n) average case complexity thanks to the shallowness of the 
trees they are inducing. 

This perceived shallowness also has some impact on comprehensibility. The 
heuristic methods need to induce larger trees to be competitive with respect 
to predictive accuracy. Obviously, larger trees are harder to read. But due to 
the additive nature of alternating decision trees, they can be understood as 
the sum of all paths. Consequently, we can look at single paths in isolation to 
understand their respective contribution to the final prediction. Luckily, in a 
shallow tree most of these paths are rather short, thus they will be relatively 
easy to comprehend. 

In future work we will investigate other approaches on speeding up the orig- 
inal algorithm, which will be based on adaptive caching of some of the statistics 
that are currently recomputed over and over again. Furthermore, the alternat- 
ing decision tree algorithm can be extended in a variety of ways. The first and 
most important is to produce a version of the algorithm capable of handling 
multiple classes. It would also make sense to apply the trees to regression and 
cost-sensitive classification problems. Unlike standard decision trees where com- 
bining is multiplicative, combining alternating trees is linear which opens up the 
possibility of being able to bag 0 them to hopefully perform well. Especially 
in the presence of noise which is problematic for boosting algorithms in general 

0. such a bagging approach might alleviate boosting’s tendency to overfit the 
noise. 

The improved ADTree induction algorithm as well as the artificial data- 
set generator described above will both be included into the next version of 
the WEKA machine learning workbench |0| , which is availablcB under the Gnu 
Public License. 
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Abstract. CAEP, namely Classification by Aggregating Emerging Pat- 
terns, builds classifiers from Emerging Patterns (EPs). EPs mined from 
the training data of a class are distinguishing features of the class. To 
classify a test instance t, the scores by aggregating EPs in t measures 
the weight we put on each class; direct comparison of scores decides t’s 
class. However the skewed distribution of EPs among classes and intri- 
cate relationship between EPs sometimes make the decision by directly 
comparing scores unreliable. In this paper, we propose to build Score 
Behaviour Knowledge Space (SBKS) to record the behaviour of training 
data on scores; classification decision is drawn from SBKS from a statis- 
tical point of view. Extensive experiments on real-world datasets show 
that SBKS frequently improves CAEP classifiers, especially on datasets 
where they have relatively poor performance. The improved CAEP clas- 
sifiers outperform the start-of-the-art decision tree classifier C5.0. 



1 Introduction 

The recently proposed classification model CAEP, Q namely Classification by 
Aggregating Emerging Patterns 03 , builds classifiers from Emerging Patterns 
(EPs)P. EPs mined from the training data of a class are distinguishing fea- 
tures of the class. Functions have been proposed to measure the aggrerate 
contribution of EPs that appear in a test instance t, resulting in scores', □ t 
is labelled the class with the highest score. Although the order of scores is 
generally a good indication of Ps label, the unbalanced distribution of EPs 
among classes and intricate relationship between EPs sometimes make it un- 
reliable. In this paper, rather than relying on the absolute order of scores, 
we propose to consider the behaviour of training data on scores to make 
classification decision. Specifically, we build Score Behaviour Knowledge Space 
(SBKS) for training data and derive the final classification decision from a 
statistical point of view. Experiments on 28 datasets from the UCI machine 
learning repository (http : //www . ics .uci . edu/^mlearn/MLRepository . html) 

^ CAEP also refers to a specific classifier(§|2}, which will be clear from context. 

^ Scores refer to both the aggregate score of CAEP and encoding cost of iCAEP(§|3). 
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show that SBKS frequently improves the accuracy of CAEP classifiers, especially 
on datasets where they are relatively weak. 

Behaviour knowledge space 0 was first proposed in pattern recognition for 
combining multiple experts (CME): When K classifiers give their individual 
decisions e(l), e{K) about the identity of an input x, what is the combination 
function £’(e(l), e{K)) which can produce the best final decision? Behaviour 

knowledge space consists of units where the decision of classifiers are recorded 
and CME decision is drawn. We incorporate the idea of recording the behaviour 
of training data on scores and develop elaborate scheme constructing SBKS and 
efficient algorithms searching SBKS for reliable classification decision. 

CAEP maps training instances into points in an Euclidean space for scores. 
Models like decision tree or nearest neighbour can be used to classify such space. 
Decision tree ^ divides the space into regions by considering dimensions se- 
quentially. In contrast, SBKS concurrently divides score space into fine-grained 
“hypercubes” . K-nearest neighbour classifies a test point by voting on its k 
nearest points, where a dimension may be dominant in the distance measure. 
With SBKS, each dimension is of equal weight and our classification algorithm 
searches subspace of SBKS for decision. 



2 Classification by Aggregating Emerging Patterns 



Items are (continuous_attr , interval) or (discrete_attr , value) pairs. 
An itemset is a set of items; an instance defined by n attributes is an itemset of 
n items. The support of itemset x in dataset D, suppoix), is lifAjZPSfii ^ Given 

D' and D” , the growth rate of an itemset x from D' to D” is GR{x) = 

= 0 and ^ = oo); EPs from D' to D", or simply EPs of D", are itemsets 
with growth rate greater than a threshold minrate (minrate > 1). 

Example 1. e is an EP from the Malignant (M) to Benign (B) class of Breast 
cancer (Wise): e={ (Bare-Nuclei , 1) , (Bland-Chromatin, 3) , (Normal-Nucleoli , 1) , 
(Mitoses,!)},. suppM{e) = 0.41%, suppsie) = 20.31% and GR{e) = 49.54. e 
has high predictive power: With odds of 98%, instances containing e are benign. 



For training dataset D of m classes, D — DiUD 2 U...UDm, where Di consists 
of training instances for class C, EPs of all classes, A = U i ?2 U ... U Em, is 
the model for D; Ei is the EP set for Gi, consisting of EPs from D — Di to Di. 
In Example if an instance t only contains e, we tend to assign t “Benign” . 
However, if t contains EPs of both the Benign and Malignant, as will be discussed 
next, we classify t by aggregating the EPs appearing in t. 

CAEP PI aggregates EPs from a probabilistic perspective. The combined 
power of EPs of Gi that appear in t is t’s aggregate score (or score) for Gi, 
where the first item computes the odds that t belongs to Gp. 



score{t, Gi) 



E 

eGt,e^Ei 



GR{e) 
GR{e) + 1 



* suppDi (e) 
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Naturally, one tend to assign t the class Ci where score{t, Ci) is the max- 
imum. It turns out, however, classes present different features on EPs and di- 
rect comparison of scores often leads to inaccurate decision. Normalization is 
proposed to overcome the problem. Instead of letting the class with the high- 
est raw score win, CAEP lets the class with the highest normalized score win: 
normscore{t,Ci) = hase^scor^(c ■) ’ basescore{Ci) is got at a percentile 

(50%-85%) when the scores are in decreasing order. 

iCAEP PI aggregates EPs in an information-based approach. According to 
the minimum message length theory, t should be labelled Ci where the total cost 
of encoding Ci and of encoding t under Ci is the minimum. E is the model for D, 
and each class have the same encoding length under E. However, the encoding 
length of t under different classes is different: EPs of Ej, with high support in 
Dj and low support in Di{i ^ j), have the smallest encoding cost in Cj. E*, the 
representative EP set to encode t, consists of long EPs of all classes 0: 

m 

E* = A*, A* = {cfe G Ej\k = is a partition of t 

Under the encoding scheme that a message of probability P incurs an encoding 
cost of log 2 {P) bits, the encoding cost of t under Ci, L{t\\Ci), is 

p 

L{t\\Ci) = - ^log2P{ek\C^), Ck G 

k^l 

t is assigned the class Ci where L(t||Ci) is the minimum. Experiments show that 
normalization can not notably improve classification accuracy. 

ConsEPMiner0 is employed to mine EPs. Given support threshold 
minsupp, growth rate threshold minrate, growth-rate improvement thresh- 
old minrateimp, (the growth-rate improvement of an EP e, rateimpie), is 
miniye' C e,GR{e) — CR{e')),) using all constraints to effectively control the 
blow-up of candidate EPs, ConsEPMiner successfully mines EPs from large high- 
dimensional datasets. 

3 Building Score Behaviour Knowledge Space to Make 
Classification Decision 

To further solve the problem of unreliable decision by comparing scores, we 
propose to build score behaviour knowledge space to make classification decision. 



3.1 Score Behaviour Knowledge Space 

Given a training dataset of m class labels, a Score Behaviour Knowledge Space 
(SBKS) is an m-dimensional space where each dimension corresponds to the 
score for a class. As will be discussed in Section O for each dimension i, 1 < 
i < m, the score range [0, oo) is first divided into Ki intervals numbered as 
1, 2, 3, ..., Ki. A unit is an m-dimensional hypercube defined by m intervals, 
denoted as u = {ui,U 2 , SBKS then consists of Ki* K 2 * ... * K^ units. 
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A subspace of SBKS, S = [[mh,ui 2 ], consists of the units falling 

into the range defined by S. For example, in a 2-dimensional SBKS, the subspace 
S = [[1, 2], [5, 6]] consists of 4 units, namely, S = {(1, 5), (1, 6), (2, 5), (2, 6)}. 

For a training instance t where score{t) = {x\, t is associated with 

the unique unit u = {u\, where Xi falls into the interval Ui = 

In SBKS, each unit records the number of incoming (training) instances for each 
class. Given a unit u, let riu{ll) denote the number of instances of class II in u 
and Tu denote the total number of instances in u, n-uii)- For subspace 

S, ns{i) = Ts = semantics of SBKS is clear: for a 

subspace S, the probability that an input falling into S belongs to class i is 
{Ts > 0). The class with the largest probability is called the representative class 
of S, denoted as E{S). When no training instances fall into S or when classes 
have the same probability, the representative class of S is nil. 

( II if Ts > 0 , is the maximum 

E{s) = \ 

( ml otherwise 



Example 2. In CAEP, for 100 randomly selected training instances from Horse 
colic (2 classes), dividing score dimensions into 5 intervals, we get the SBKS of 
Tabled where for a unit u we record n„(C'i)/n„(C' 2 ). With unit (4, 2), a total 
of 9 instances fall into this unit; 8 are C\ instances and 1 is a C 2 instance. For an 
instance t where score{t,C\) £ [17.6,26.3), and score{t,C 2 ) £ [0.02,42.5), with 
probability of = 88.89%, t belongs to Ci; with probability of = 11.11%, 
t belongs to C 2 . E{{4, 2)) = Ci. 



Table 1. The SBKS for 100 Horse colic Instances 



Cl \ C2 


1 : [0, 0.02) 2 


[0.02,42.5) 3 


[42.5,85) 4 


[85,127.4) 5 


[ 127 . 4 , 00 ) 


1 : [0, 0.04) 


2/0 


0/0 


0/2 


0/1 


0/12 


2 : [0.04,8.8) 


4/0 


7/6 


3/3 


1/3 


0/7 


3 : [8.8,17.6) 


1/0 


12/1 


4/0 


0/0 


0/0 


4 : [17.6,26.3) 


1/0 


8/1 


1/0 


0/0 


0/0 


5 : [26.3,00) 


2/0 


18/0 


0/0 


0/0 


0/0 



3.2 Building SBKS to Make Classification Decision 

To build SBKS, we need to first divide [O.. 00 ), the score range for training 
and test instances, into intervals. Sorting the scores of training instances into 
increasing order, we get the range [S'ij, Si^), where 80% of training instances fall 
into. By equally dividing Si^) into K intervals, K + 2 intervals are formed: 
[0,5'iJ, [5*1, 5*1 -ha), ..., [Si^ + {K-l)*a,Si^), [Si^,oo), where a = {Si^-SiJ/K. 
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Given an m-dimensional SBKS, to classify an input t where score(t) = 
(xi, ... Xm), suppose that t falls into unit u, if E{u) ^ nil, t is assigned E(u); 
otherwise, larger subspace centered at u is searched. As shown in Fig. Q], at 
line 6, function 5'. increment () forms a larger subspace centered at S: Sup- 
pose S = [[uii,ui 2 ], ..., [umi,Um 2 ]]) after S', increment (), S = [[un — l,un -I- 
1], ..., [umi — 1, Mm 2 + 1]]) which takes in all the units surrounding S. From lines 
3 to 6, S keeps growing, until t’s label is decided or the whole SBKS is searched. 



classify (SBKS B, test instance t) 

; ; decide t’s label from B, score{t) = {xi,X 2 , , Xi is the score for Ci 

1) ll •<— nil ; 

;; for 1 <i<m, Ui = [sijjSia) and Si^ < Xi < Si 2 

2) S i [ [ril , ni] , . . . , [Mtti , 'Um] ] i 

3) while ll = nil do 

4) ll^E{S); 

5) if ll 7 ^ nil then return ll; 

6) S. increment 0 ; 



Fig. 1. Classify (): Classify a test instance by SBKS 



Suppose that each dimension consists of K intervals, to decide the label of 
an input t, ClassifyO searches from a minimum of 1 unit to a maximum of 
units. In implementing the algorithm, to speed up searching, SBKS is stored 
as a hash tree, where units are hashed on the first two dimensions. Usually the 
algorithm can finish within one iteration. When several iterations are necessary, 
as m is typically very small (m < 10), the algorithm is still efficient. 

Example 3. Fig. 0 plots the SBKS of Tabled with the representative class for 
each unit. Given an input t, score{t) = (16.8,86.7), t’s label can not be deter- 
mined from (3,4), for T( 3 _ 4 ) = 0 and U((3, 4)) = nil. A larger subspace centered at 

(3.4) , the square area with bold borders is searched: S = [ [2,4], [3,5] ] = {(2,3), 

(2.4) , (2,5), (3,3), (3,4), (3,5), (4,3), (4,4), (4,5) }. From Tabled ns{Ci) = 9 and 
ns{C 2 ) = 13, and thus E{S) = C 2 . t is labelled C^- Apparently SBKS can build 
complex classification models, even when the space is not linearly separable. 



4 Experimental Results 

Applying SBKS to GAEP and iGAEP, GAEP** and iGAEP** are produced, where 
for GAEPl*, SBKS is constructed from the original scores of GAEP before nor- 
malization. Experiments were done on 28 UGI datasets with the following set- 
tings: Glassification accuracy was measured with 10-fold cross validation. For 
GonsEP Miner, minsupp = 1% or a count of 5, whichever is larger, minrate = 5, 
minrateimp = 0.01, and the EP set for a class is limited to 100,000 EPs. For 
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C2 




□ 



Cl 



C2 



nil 



Fig. 2. Classification decsion from the SBKS of TableQ 



CAEP, the base score is chosen at 85%. For CAEP^* and iCAEP**, SBKS was 
built with 10 intervals for each score dimension. 

Fig. |3plots the accuracy ratio of CAEPi*/CAEP and iCAEP**/iCAEP on 28 
datasets. Points above the line at the ratio 1.0 indicate that SBKS produces 
improvement. Several conclusions can be drawn from the figure: (1) Obviously 
SBKS improves the accuracy of both CAEP and iCAEP, by an average of 1.6% 
and 1.3% respectively. With the accuracy ratio of CAEP** /CAEP ranging from 
0.945 to 1.128, and that of iCAEPi*/iCAEP ranging from 0.949 to 1.119, while 
SBKS produces accuracy improvement up to 12.8%, it never significantly de- 
grades the performance of CAEP or iCAEP. (2) Compared with CAEP, CAEP** 
improves accuracy on 17, ties on one and loses on 10 datasets; compared with 
iCAEP, iCAEPt* improves accuracy on 17 and loses on 11 datasets. More im- 
portantly, SBKS produces significant improvement on datasets where the per- 
formance of CAEP and iCAEP is relatively poor. We also compared CAEP 
classifiers with C5.0 jSl- The average accuracy of iCAEP®, CAEP® and C5.0 are 
88.08%, 87.44% and 87.07 respectively; both iCAEP® and CAEP® outperform 
C5.0. Further experiments are underway comparing the performance of CAEP 
classifiers with that of C5.0 under boosting jOj. 

5 Conclusions 

We have presented a behaviour knowledge-based approach for making classifi- 
cation decision, and have evaluated the method on CAEP — Classification by 
Aggregating Emerging Patterns. In the original CAEP classifiers, aggregate con- 
tribution of EPs for each class is quantified as scores and classification decision is 
reached by comparing scores. With Score Behaviour Knowledge Space (SBKS), 
we record the behaviour of training data on scores, and thus can take into ac- 
count the varying features of EPs for each class. Experiments on 28 UCI datasets 
show that SBKS improves the performance of CAEP classifiers and the resulting 
CAEP classifiers outperform C5.0, the most advanced decision tree classifier. 



Accuracy ratio 
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Fig. 3. Accuracy ratio: CAEP# vs. CAEP and iCAEP" vs. iCAEP 
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Abstract. Clustering is an important data exploration task. A promi- 
nent clustering algorithm is agglomerative hierarchical clustering. 
Roughly, in each iteration, it merges the closest pair of clusters. It was 
first proposed way back in 1951, and since then there have been numer- 
ous modifications. Some of its good features are: a natural, simple, and 
non-parametric grouping of similar objects which is capable of finding 
clusters of different shape such as spherical and arbitrary. But large CPU 
time and high memory requirement limit its use for large data. In this 
paper we show that geometric metric (centroid, median, and minimum 
variance) algorithms obey a 90-10 relationship where roughly the first 
90iterations are spent on merging clusters with distance less than lOthe 
maximum merging distance. This characteristic is exploited by partially 
overlapping partitioning. It is shown with experiments and analyses that 
different types of existing algorithms benefit excellently by drastically 
reducing CPU time and memory. Other contributions of this paper in- 
clude comparison study of multi-dimensional vis-a-vis single-dimensional 
partitioning, and analytical and experimental discussions on setting of 
parameters such as number of partitions and dimensions for partitioning. 



1 Introduction 

Clustering is an important data exploration task. It is applied in different ar- 
eas including data mining. Surveys on clustering algorithms can be found in 0 
0. Among the more prominent clustering algorithms hierarchical clustering is 
one. It was first proposed in 1951 |0I, and since then there have been numer- 
ous modifications including the recent ones |iSf 1 1 )j . The fact that it has been a 
prominent algorithm for last half a century shows that it has stood the test of 
time. Among its various good features it is a non-parametric (assumes very little 
in the way of data characteristics), natural and simple way of grouping objects, 
and capable of finding clusters of different shapes such as spherical, arbitrary. 
But its large CPU time and high memory requirement make it unsuitable for 
large data. Efficient techniques to handle large data can be roughly classified 
as sampling, summarizing, and partitioning. Sampling has been used in several 
algorithms and recently in CURE jHj. Basically, clustering is done over a small 
sample and results are extended to the whole data set. Although it is criticized 
for missing out small clusters, sampling usually retains the underlying cluster 
structure. Summarizing algorithms are based on the fact that data points that 
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are very close to each other can be merged and their summary information can 
be used for efficient clustering OH- Partitioning is used in |Bj and in parallel 
algorithms in m where data is partitioned and each partition is then clustered 
and later the results are consolidated to determine a global set of clusters. These 
techniques help in reducing the size of the data while trying to retain the original 
cluster structure but they still require to run the traditional algorithms on the 
reduced data - a sample, a summary, or a partition. Improving the traditional al- 
gorithms will have complementary effect on these algorithms. This paper focuses 
on reducing the time and memory requirement of hierarchical algorithms. 



2 Background 

It is assumed that data contains N data points with each point x described 
by an M-tuple x = {x\^..,xm) of real numbers where M is the number of 
attributes or dimensions U. Distance between two objects is usually calculated 
using Minkowski metrics where for fixed d and for any two objects x and y, 
Ld{x,y) = ~ This family includes the Manhattan metric Li, 

Euclidean metric L2, and Chebychev metric Loo- For ease of explanation it is 
assumed that distance is measured in Euclidean metric. 

Hierarchical algorithms can be broadly classified as divisive and agglomera- 
tive. In divisive type, starting with one cluster that contains all data points, a 
cluster is divided in each iteration until each data point belongs to a distinct 
cluster. In agglomerative type, starting with each point in a distinct cluster, 
the closest pair of clusters are merged in each iteration until there remains only 
one cluster containing all data points. The output of hierarchical clustering is 
a dendrogram which is a hierarchy of divisions or agglomerations with the top 
level or root representing all points in one cluster, and the bottom level or 
leaves representing each point in a distinct cluster. The computation needed in 
an agglomerative algorithm to go from one level to another is usually simpler 
than a divisive algorithm. In this paper we are concerned with agglomerative 
algorithms. They can be classified into two categories depending on the type of 
similarity measure used, namely graph and geometric metrics. Algorithms that 
use graph metrics are single link, complete link, and average link, and those 
using geometric metrics are centroid, median and minimum variance. A basic 
difference between the two types is that in graph metrics each point is a repre- 
sentative while in geometric metrics each cluster has only one representative, eg. 
centroid. This difference is important from the point of view of performance and 
shape of clusters. For example, graph metric algorithm (eg. single link) can dis- 
cover arbitrarily shaped clusters while geometric metric algorithm (eg. centroid 
method) is more suitable for clusters of spherical shape. 

It is observed that for geometric metric algorithms, except for the last por- 
tion of the iterations, the size (i.e. number of points) and the merging distance 
of the closest pair of clusters is very small compared to their maximums respec- 
tively. It is further observed that initial iterations are typically much costlier 



^ Hierarchical clustering over binary and nominal data types are discussed in |J|. 
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(A) Distance Plot 

Iteration Number vs. 
Merging Distance 




(B) Size Plot 

Iteration Number vs. 
Size of the Larger Cluster 



(C) Time Plot 

Iteration Number vs. 
Time until the Iteration 



Iteration Number - 




Fig. 1. Important observations for geometric metric algorithms 

than those towards the end. See Figure Q for plots showing these observations 
pictorially. The plots are based on the results obtained for a 2-D data set con- 
taining 3000 points in 100 clusters, each a Gaussian ball, distributed randomly 
in the feature space with some noise. Each dimension has a range [0.0 - 10.0]. 
This data is a simulated version of the data set DS2 reported in BIRCH All 
experiments here and later in the paper are run on Digital 8400 5/350 with 350 
MHz processor and 1 GByte of EGG main memory. The results are for centroid 
type algorithm. In FigureQ(A) the merging distance is plotted against iteration 
number. Although the merging distance is not monotonically increasing, i.e. in a 
later iteration it can be larger than that in an earlier iteration, but still, roughly 
only in the last 10% or so iterations it is more than 10% or so of the maximum 
merging distance. This behavior is seen for other geometric metric algorithms 
and for different data including high-dimensional, multi-resolution, and skewed 
distribution. But it may not be observed for graph metrics. The reason is in 
geometric metric (eg. centroid type), a cluster is represented by single point 
(i.e. centroid); after merging the closest pair of clusters, the distance of the new 
cluster from most of the other clusters (except for those clusters very close to 
the bisecting hyper-plane between the two merging centroids) is larger than the 
smaller of the two distances before merging, whereas in graph metric (eg. single 
link) distance after merging will be equal to the smaller of the two distances be- 
fore merging 0. We will call these observations 90-10 relationship which means 
roughly 90% or so iterations from the beginning merge clusters that are separated 
by less than 10% or so of the maximum merging distance. 

Gentroid and median types of geometric metric algorithms merge clusters 
whose centroids are the closest. Gentroid type is known as unweighted as it 
treats each point in a cluster equally whereas median type is known as weighted 
as it weights all clusters the same, so points in small clusters are weighted more 
heavily than points in large clusters. The minimum variance Ward’s method 
merges clusters that results in minimum change in square error m- These algo- 



^ Because of this characteristic single link algorithm follows reducibility property m- 
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rithms observe the 90-10 relationship. In the rest of the paper centroid method 
will be discussed in detail except when otherwise specified. The first centroid 
algorithm proposed uses a similarity matrix. In each iteration the similarity ma- 
trix is searched to find the closest pair of clusters. Let us call this ‘step V . For 
n (n = N...2) clusters in an iteration, it requires 0{n^) time to search through 
the similarity matrix, and for — 1 iterations this step takes 0{N^) time. After 
merging the closest pair the similarity matrix is updated by deleting the col- 
umn entries for the pair that merged and by creating a new row for the merged 
cluster by determining distance from other clusters. Let us call this ‘step 2’. 
For n clusters in an iteration this step takes 0{n) time to update, and so for 

— 1 iterations it takes time. Memory required is 0{N‘^) because of 

the similarity matrix. This simple algorithm can be improved by maintaining a 
nearest neighbor array that stores nearest neighbor for each cluster. This way 
step 1 requires only to find the minimum of the nearest neighbor distances from 
the nearest neighbor array in 0(n) time for n clusters in any iteration. But step 
2 requires 0{n?) time in each iteration if naive nearest neighbor algorithm of 
0{n) time complexity is used. In 1984, Day and Edelsbrunner 0 suggested two 
ways to improve the time complexity of this algorithm: Type (1) - Obtain an 
improved bound on the required number of nearest neighbor updates - i.e. im- 
prove the step 1; Type (2) - Obtain an improved bound on the time required for 
each update - i.e. improve the step 2. For type (I) they suggested to use a heap- 
based priority queue that requires O(logn) time to find nearest neighbor giving 
an over-all time complexity of 0{N^logN). For type (2) algorithm if there are 
a number of clusters to be updated after each iteration then the over-all com- 
plexity becomes 0{aN'^). Using geometric preliminaries they proposed an upper 
bound for a as 2(3'^ — 2) where M is the number of dimensions. 

Anderberg classified hierarchical algorithms into ‘stored matrix’ and ‘stored 
data’ Stored matrix algorithms maintain a similarity matrix whereas stored 
data algorithms do not but instead calculate the similarities as required. A major 
distinction is that stored matrix methods are preferred when memory is suffi- 
cient to store the similarity matrix of size otherwise stored data is the 

way out. Type (1) algorithm of previous paragraph that uses priority queues is 
a stored matrix method while type (2) algorithm is more suitable as stored data 
if M is not large and if memory is not enough to store 0{N^) similarity matrix. 
Recent algorithms on hierarchical clustering 181111121 uses priority queues. The 
algorithms presented here improves both stored matrix and stored data algo- 
rithms by reducing their CPU time significantly, and furthermore it reduces the 
memory requirement substantially for stored matrix algorithm. 



3 Proposed Algorithms 

In the previous section we discussed the existence of a 90-10 relationship between 
the stage of algorithm and the merging distance. We also observed that initial 
iterations are very costly. In this section we propose algorithms using partially 
overlapping partitioning that exploits these properties. The following figure pic- 
torially shows a single-dimensional partially overlapping partitioning approach. 
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Data is divided into p number of partitions or cells by dividing a dimension 
range (in this case A) into p number of smaller ranges. Each cell is sandwiched 
between two J-regions which are typically much smaller than the cell. The first 
and the last cells have only one adjacent ^-region. Henceforth, by definition, a 
cell is inclusive of its adjacent (5-regions. Note that every (5-region is included 
in two cells. The value of (5 can be mentioned as a percentage of the range of 
the attribute or as an absolute value. Using this data structure we propose two 
algorithms: 2-phase and nested. In the rest of the section we use priority queue 
algorithm to explain although complexity analysis is done for other algorithms. 




A 

2-Phase Algorithm. The basic idea is instead of creating priority queues for 
all data, now there will be priority queues for each cell separately. In each it- 
eration closest pair is found for each cell, and from those the over-all closest 
pair is found. If the over-all closest pair has distance less than 5 then they are 
merged and the priority queues of only the corresponding cell are updated. If 
any of the merging cluster centers or merged cluster center is in any (5-region 
then the priority queue of the affected cell is also updated. This procedure is 
repeated until the merging distance is larger than 5. This is the phase 1 of the 
algorithm. In phase 2 traditional priority queue algorithm is employed over the 
remaining clusters to complete the dendrogram. Note that the same 6 is used to 
partition the data as well as to stop the algorithm. This way large number of 
small sized clusters are merged in the phase 1 that uses partitioning and only 
small number of larger clusters are merged in phase 2 that uses traditional al- 
gorithm. Partially overlapping partitions are employed earlier in jElEI in order 
to speed-up nearest neighbor search. In nearest neighbor algorithm 6 is ideally 
set to a value slightly larger than the minimum distance between all points so 
that actual nearest neighbor is located in the same cell as the candidate point 
and optimum performance is obtained. But an ideal S for clustering purposes 
is the distance corresponding to the point at which distance plot takes a sharp 
up-turn as shown in Figure H More on this is discussed later. Efficient nearest 
neighbor algorithm cannot benefit the priority queue clustering algorithm where 
nearest neighbors of each cluster is already in the top of each queue. Instead it 
requires to be maintained efficiently as the algorithm progresses. Nearest neigh- 
bor algorithms gain in time but not in memory, whereas, as will be evident soon, 
clustering algorithms gain significantly in both. 

Table E exhibits the 2-phase partitioning algorithm using priority queues 
with time and memory complexities. Steps 1-10 is phase 1 and 11-12 is phase 
2. Notations: |(5| is number of points in a ^-region, p is number of cells, k' is 
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the remaining number of clusters after phase 1. Complexity analysis is done by 
assuming all cells to be of equal size. 

Complexity Analysis: The 2-phase algorithm in Table Q] uses priority queues 

Table 1. An example of 2-phase partitioning approach over the priority quene algo- 
rithm 



Algorithm 


Time 


Memory 




Complexity 


Complexity 


Input: Data (N,M)^ p, 6 
Output: Dendrogram 

1. choose an attribute to partition the data 

2. divide the data into p number of cells 

where each cell has two adjacent regions 
of width (5 (except first and last cells) 


0(1) 

0(N) 


0(N + p* |5|) 


3. Create priority queues P for each cell 


P*0{(f + \S\f) 


P*0((f + I-5I)") 


4. while merging distance < S 

5. for each cell find the closest pair of 


p*0{^ + |5|) 




clusters, Ci and C 2 






6. find the over-all closest pair 

7. merge the over-all closest pair 

8. update corresponding P 


0(p) 

0(1) 

0((f + |5|)log(f -H|5|)) 




9. if any of the pair or merged cluster in 


0((f + |5|)log(f -H|5|)) 




(5-region then update P of affected cell 






10. remove the duplicate copies of clusters 


p * 0{k') 




in (5-regions, if any 






11. cluster the remaining k' clusters 

12. return dendrogram 


0(k''^ log k') 


0(k'^) 


Over-All Complexity 


{N - k')* 


p*o((!^ + \5\Y) 




0((f -H5|)log(f + |5|)) 


or 




+0(k''^ \osk') 


o(ic'b 



first proposed in (^. The original algorithm has an over-all time complexity 
0{N'^ log N) and memory complexity 0{N'^). On the other hand, the 2-phase 
algorithm has an over-all time complexity {N — k') * 0{{^ + |<5|) log(=^ -I- |i5|)) -I- 
log fc') assuming log(=^ -I- |<5|) to be greater than p. Memory complexity is 
p*0((^-|-|i5|)^) or whichever is larger. If |5| and k' are small and if they 

are assumed negligible then the complexities simplify significantly giving time 
complexity N*0{j log j), i.e. 0(^ log j), and memory complexity 
i.e. 0{f). 

In Table we give time and memory complexities of ‘stored matrix’ and 
‘stored data’ types of algorithms for before and after simplification. Under ‘stored 
matrix’ category, in addition to ‘priority queue’ type there is ‘similarity matrix’ 
algorithm that uses a similarity matrix in place of priority queues. 

Correctness : The overlapping partitioning data structure guarantees a correct 
dendrogram. Note that number of correct dendrogram can be more than one 
due to the existence, if any, of ties between merging distances. 



Nested Algorithm. Time taken by a 2-phase algorithm depends on the total 
time taken by phase 1 and 2. As phase 2 is the traditional algorithm whose time 
complexity directly depends on the remaining number of clusters {k') after phase 
1, for good performance phase 1 should take small time and k' should be small 
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Table 2. Time and Memory Complexity Comparison: * - Simplification is done by 
assuming |<5| and k' negligible 



Type of 
Algorithm 


Traditional 

Algorithm 


2-Phase Algorithm 


Before* 

Simplification 


After* 

Simplification 


Stored Matrix 

- Similarity Matrix 

- Priority Queues 


Time Complexity 


0{N^) 


-hO(fe'2 logfe') 


o(f) 


0{N^ log A) 


(A-fc')*0((f + |5|). 
log(f + |<5|)) + 0(fc'^logfc') 


O(^logf) 


Stored Data 


0{a*N^) 


(A-fc')*0(a*(f + |5|)) 
-hO(a*fc'2) 


0{a * ii a > p 

or, 0{N^) if p > Q 




Memory Complexity 


Stored Matrix 

- Similarity Matrix 
- Priority Queues 


0{N^) 




O(^) 


0{N^) 


P*0{{^ + \5\f) 


O(^) 


Stored Data 


0{N) 


P*0{^ + \5\) 


0{N) 



as well. Experiments show that it may not be easy to detect the point at which 
the total time is the least. Experiments are conducted over DS2 data set (used 
earlier for distance plots in Figure^) with 7V=3k in 2-D. Number of cells is fixed 
to 10 while 6 varies from 0.25 to 9.0. Figure|^A) shows the time taken by phases 
1, 2 and their total time. We redraw the distance plot of Figure 1(A) with some 



(A) 5 vs. Phase 1, Phase 2, 

& Total time (p = 10) 




(B) Iteration Number vs. 




(C) A sample follows 
the whole data 




Fig. 2. Results and Analysis of Two-Phase Algorithm 

labels in Figure EKB). Label b points to a proper value of 6, label a points to the 
minimum merging distance. The reason why b is considered as an ideal S value 
is that while on one hand it is very small, on the other hand it is able to capture 
most of the iterations leading to good performance. For 6 smaller than b phase 
2 takes long time as remaining number of clusters is still large, whereas for 6 
larger than b phase 1 takes more time than the optimal as larger 5 means larger 
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overlapping between cells. So we see that b is an ideal value for S for efficient 
clustering. Contrast this with the ideal S for nearest neighbor search methods 
discussed in mm that employs partially overlapping partitioning. Hence, we 
see that a is an ideal S for nearest neighbor search while b is ideal for clustering. 

It may not be easy to guess b value properly though. The following approach 
may be helpful. Take a random sample of points and obtain its distance plot. 
Figure 13(C) shows that a sample follows the whole data where the b values for 
the sample and the whole data almost match. The main reason is a sample will 
typically increase the smallest distance among points in a cluster but will retain 
the larger distances among the clusters themselves. This is also the reason why 
the sampling approach has been efficient for clustering. Manual observation can 
find a proper 6 value. A window function to detect the point at which distance 
increases significantly can lead to automatic detection. 

We propose a nested algorithm that does not require a manual setting of 
6. It tries to make parameter setting even easier while gaining in performance 
in comparison to 2-phase algorithm. It is similar to the 2-phase algorithm ex- 
cept that it does not switch to phase 2 immediately. Instead it performs phase 
1 with reduced number of clusters until a negligible number of clusters remain 
(eg. 2). The algorithm starts typically with a large value for p and a small 
value for 6. With each iteration of phase 1, p is gradually reduced while S 
is increased. Arguably this is easier than to set S to b from the start itself 
as in 2-phase algorithm. Nested partitioning attempts to divide the difference 
(b-a) (shown in Figure EJB)) to gain in performance over 2-phase algorithm. 
More discussion on how to break the range (b-a) is given in the next section. 
Complexity Analysis: Let us assume that the nested sequence is specified by 
<Pj,Sj> for j=l...s where pj and 6j are p and 6 values in nested iteration 
respectively, and s is the number of iterations in the nested sequence. Let Uj 
be the number of clusters remaining after nested iteration j — 1. Initially ni 
is N and finally Ug+i is k' . Time complexity of nested algorithm is given as 
Ei=i - «j-ei)0((^ + |<5j|)log(^ -k \Sj\))) + 0{k''^ log k'). In this case k' is 
made negligible by specifying sufficient number of iterations in the nested se- 
quence so that the last term 0{k'^ log k') can be discarded from the complexity. 
Memory complexity of nested algorithm is maxj =i...s{Pj*0{{^ + \Sj\)'^)) because 
after each iteration of the nested sequence the priority queues are freed. 

Overlapping Partitioning for Minimum Variance Ward’s Method. Par- 
tially overlapping partitions can be applied with some modification to Ward’s 
minimum variance method m- Like centroid method, it represents a cluster by 
the centroid. But unlike centroid method, in each iteration the pair of clusters 
with the least increase in the sum of square error are merged. Square-error for 
cluster k is the sum of squared distances to the centroid for all points in cluster 

k. Mathematically, e| = l^ij^ ~ where is the dimen- 

sion value of point Xi of centroid k, is the dimension value of cluster 
k and j = If cluster r and s merge in an iteration to create cluster t, 

then the change in square errors is given as: — e^. By replac- 
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ing the expressions for square error and after simplification, we obtain that: 

~ Note that the second term of the right hand 

side is the squared distance of the centroids of clusters r and s. The first term 
can take values less than, equal to, or greater than 1. It is less than 1 when 
at least one of the clusters has size 1. This term makes direct application of the 
partially overlapping partitioning unsuitable although it shows 90-10 relation- 
ship. The reason is there is no guarantee that the closest pair is merged in each 
iteration as they may be in different cells. But note that the minimum value of 
this first term is ^ for the case where size of both the merging clusters r and s is 
1. The minimum possible with distance S is Hence to guarantee that 
the two clusters merged belong to at least one common partition the following 
condition must hold: the AE^^ of the closest pair r and s must be less than ^6'^ 
so that distance between them will be less than S. 

4 Performance 

In this section we discuss different performance issues, and then perform exper- 
iments using the suggested parameter settings. Data Sets used for experiments 
include synthetic and benchmark data. Synthetic data sets are: DS2 used in 
BIRCH [IH|; t4.8k, t5.8k, t8.8k, t7.10k used in CHAMELEON P2| and CURE |E|; 
SEQUOIA benchmark data ^ used in DBSCAN P|. We generated DS2 data 
in different dimensions with uniform (DS2_U) and Gaussian (DS2_G) distribu- 
tion within clusters . The rationale behind experimenting over these data is to 
compare the partitioning algorithm with traditional algorithm for different dis- 
tribution of data and different shape of clusters in varying dimension and noise. 

Performance Issues. From the complexity analyses in the previous section, 
it is apparent that performance of the partitioning algorithm depends largely 
on number of cells p, and 6. Other important factors are (a) number (m) and 
choice of dimensions to partition, (b) how to change 6 and p values in nested 
partitioning. 

How Many Dimensions to Partition (m): It is affected by these factors: (I) the 
larger the m (possibly) the more uniform the distribution of points across cells 
leading to CPU time reduction, (2) given p and S, the larger the m the smaller the 
number of S regions leading to less CPU time and memory. These two factors 
indicate that larger m is preferable. But this may not be true as the larger 
the m the larger the number of cross-sections of (5-regions. As the number of 
cells required to update if a point is in the cross-section of m dimensions is 
2™, larger m will require more updating when a point is in cross-section. We 
conducted experiments by running phase 1 only for different m over DS2_U and 
DS2_G data sets, and for different p and 6. Our findings can be summarized 
as follows: multi-dimensional outperforms single-dimensional partitioning; the 
change in memory requirement is insignificant; for m > I the CPU time depends 
on different factors but a noticeable thing is there is no consistent trend and it 
is very close in general. 

Choice of Dimensions can affect performance. Some of the factors affecting it 
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are variance, range and outliers. If the range is not affected by outliers, then 
attributes with larger range and variance are usually preferred. Other possible 
approaches include principal components analysis where data is projected to the 
first m principal components. 

Setting S Sz p for Nested Partitioning In nested partitioning a sequence of S, p 
is given as the input. Experiments suggest to initialize 5 to a very small value 
and then increment it gradually by small values until there remains negligible 
number of clusters. Value of p can be set by first setting a probable number (n') 
of clusters in each cell and then p is set to Experiment with different data 
sets show that optimum performance is obtained for n' between 5 and 20. 

Experimental Results. Experiments are conducted on data sets described be- 
fore. Nested algorithm is used where n' = 7 (i.e. p = y), number of dimensions 
for partitioning m = 2 each with \_^/p\ divisions, 5 is initialized to 0.25% and is 
incremented by 0.25% for each nested iteration. Experiments are conducted to 
find speed-up and memory scale-down for varying N , M, and number of clusters. 
Varying N\ For this experiment N varies from 0.5k to 3k. As the traditional algo- 
rithm exceeds the memory allocated in our server we could not test for N larger 
than 3k. Figure OKA) shows the speed-up of the nested multi-dimensional parti- 

Varying N Varying M 



(A) Speed-up (B) Memory scale-down (C) Speed-up 




Fig. 3. Speed-up and Memory Scale-Down for 2-D data sets (A and B) and speed-up 
for varying M (C) 

tioning algorithm over the traditional priority queue algorithm. The maximum 
speed-up is 862 for t8.8k with 7V=3k where traditional algorithm took 4717.7 
CPU-sec while nested algorithm took only 5.47 CPU-sec. Note that speed-up 
factor increases with N. The reason is the complexity of iteration of the 
nested sequence is given as {rij — nj+i)0((y -|- |(5jj)log(y -I- |i5jD). Note that 
y is set to 7 in our experiments. So the second term of the expression is mostly 
indifferent to increase in N except for the increase in |^|. In fact the time dif- 
ference for partitioning algorithm for data set DS2_G for A^=0.5k and 3k is less 
than 6 CPU-sec. But, as increase in N significantly affects CPU time of the tra- 
ditional algorithm, hence the gain factor increases with N . Variation in speed-up 
for different data sets can be attributed to different distributions. Figure EKB) 
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shows the memory scale-down of the nested method over the traditional. For 
both algorithms the reported memory is the maximum during the complete du- 
ration of program where memory requirement of the programs themselves is 
negligible. The maximum scale-down is 105 for t8.8k data set where for iV=3k, 
the traditional algorithm requires 385M and the nested algorithm requires only 
5.1M. The increase in scale-down factor with N is due to the same reason as 
explained for CPU time speed-up. The variation in the scale-down factor among 
different data sets is more than that for the CPU time speed-up because the 
distribution has a direct effect over the memory requirement. 

Varying M: Experiments are conducted over DS2_G data set with A^=3k, M=2 
to 20. Number of dimensions to partition m is set to 2 for all experiments. As 
seen in Figure ETC) the gain factor for CPU time reduces with increasing M 
for the following reason. Increasing M does not affect the traditional algorithm 
much because the dominant time factor of maintaining the heap-based priority 
queues due to the insertions and deletions is not affected by increasing M. Hence, 
the small increase in time due to the larger M for nested algorithm reduces the 
gain factor, while on the other hand, memory complexity is affected very little as 
priority queues are the dominant factor which do not change with dimensions. 
Varying Number of Clusters: DS2_G data with iV=3k in 2-D is used. Number 
of clusters varies from 5 to 100 with approximately equal sized clusters. Experi- 
ments are conducted by fixing and varying N. For varying N, reduction in the 
number of clusters decreases N which reduces the speed-up factor. This result 
is similar to the results in Figure Et A). When N is fixed the speed-up factor re- 
duces as well with decrease in number of clusters because each cluster contains 
more points leading to more points in each cell. For this reason the memory 
requirement increases with decrease in number of clusters. 

Minimum Variance Ward’s Method CPU time speed-up factor is measured for 
partitioning algorithm over traditional algorithm for minimum variance mea- 
sure. For DS2_G data set with N = 3k and M = 2 the gain factor is 835; for 
M = 10 the gain factor is 532; for M = 20 it is 460. 

5 Conclusion and Future Work 

In this paper we studied an important behavior of geometric metric hierarchi- 
cal clustering algorithms that show a 90-10 relationship where roughly 90% or 
so iterations from the beginning merge clusters that are separated by less than 
10% or so of the maximum merging distance. We proposed two algorithms using 
partially overlapping partitions that exploit this behavior. These algorithms are 
suitable for both stored data and stored matrix type of algorithms. Complex- 
ity analysis and experiments show significant reduction in time and memory. 
It is found that gain factor increases with increase in number of points in the 
data. Increase in dimensions reduces the gain factor but still the gain is signifi- 
cant for high-dimensional data. We introduced a modified method to apply this 
partitioning technique to minimum variance Ward’s method. 

A future work is to parallelize this technique. In m parallel algorithms 
employ naive partitioning where all N priority queues are distributed among 
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various processors. But partially overlapping partitioning is foreseen to be very 
suitable for parallelization as a processor will be in charge of priority queues 
for clusters in only one or more number of cells which should reduce memory 
and CPU time. Other future work includes testing the suitability of other more 
sophisticated partitioning 
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Abstract. This paper presents a clustering method for nominal and 
numerical data based on rough set theory. We represent relative simi- 
larity between objects as a weighted sum of two types of distances: the 
Hamming distance for nominal data and the Mahalanobis distance for 
numerical data. On assigning initial equivalence relations to every ob- 
ject, modification of slightly different equivalence relations is performed 
to suppress excessive generation of categories. The optimal clustering re- 
sult can be obtained by evaluating the cluster validity over all clusters 
generated with various values of similarity thresholds. After classifica- 
tion has been performed, features of each class are extracted based on 
the concept of value reduct. Experimental results on artificial data and 
amino acid data show that this method can deal well with both types of 
attributes. 



1 Introduction 

Recent databases store a large amount of data composed of both nominal and 
numerical attributes. Clustering has been receiving considerable attention as 
one of the most promising approaches for revealing underlying structure in such 
databases. However, the well-known clustering methods, K- means PJ and Fuzzy 
C-Means (FCM) 0, have difficulty in handling nominal data since they require 
distance between objects that is represented on a ratio scale. Although the ag- 
glomerative hierarchical clustering method Pj can deal with nominal data by 
using relative similarity, it still has a problem that the clustering result strongly 
depends on the order of processing objects. 

This paper presents a rough set-based clustering method for data containing 
both nominal and numerical attributes. Rough sets, proposed by Pawlak have 
been receiving considerable attention in the field of knowledge discovery since 
they provide tools to mathematically treat roughness of the knowledge. Rough 
sets can easily handle nominal data since their basic properties are related to 
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the indiscernibility relation. In our method, we first form equivalence relations 
among objects based on their relative similarity, and classify them into some 
categories according to the relations. Similarity between objects is determined 
as a weighted sum of their Hamming/Mahalanobis distances. Similar equivalence 
relations will be modified so that they represent the same, more simple knowledge 
which generates adequate number of categories. The optimal clustering result 
can be obtained by evaluating the cluster validity, defined using upper and lower 
approximations of a cluster, over all clusters generated with various values of 
similarity thresholds. After classification has been performed, features of each 
class are extracted based on the concept of value reduct. Experimental results 
on artificial data and amino acid data show that this method can deal well with 
both attributes and produce good clustering results. 

2 Clustering Method 

Clustering is performed according to indiscernibility relations defined on the 
basis of relative similarity between objects. The overall procedure is summarized 
as follows. 

Stepl) For every object, assign an initial equivalence relation using a similarity 
threshold Th\. 

Step2) Modify similar equivalence relations using a threshold Th 2 - 
Step3) Iterate Steps 1-2 using various values of Th\ and Th2, and obtain the 
best clustering result that yields maximum validity. 

Step4) Extract features of each class based on the concept of value reduct. 

2.1 Initial Equivalence Relation 

The first procedure is to assign an initial equivalence relation to every object. 
Let U = {xi,X2, be the entire set of objects we are interested in. Each 

object has p attributes represented by nominal or numerical values. 

[Definition 1] Equivalence relation 
An equivalence relation Ri for object Xi is defined by 

Ri = {{xj\ s(xi,Xj) > Thi}, {xj\others}}, for all j(l < J < n), 

where s{xi,Xj) denotes similarity between objects Xi and Xj, and Th\ denotes 
a threshold value of similarity. Obviously, Ri = {[x^Jh;}}, [x^]R^ 0 

[xi]Ri = 4>, and [xi]R^ U [xi]R^ = U hold. The equivalence relation Ri classifies U 
into two categories: one containing objects similar to Xi and another containing 
objects dissimilar to Xi. When s{xi, Xj) is larger than Thi, object Xj is considered 
to be indiscernible to Xj. Similarity s(xi,Xj) is calculated as a weighted sum of 
the Hamming distance dR(xi,Xj) of nominal attributes and the Mahalanobis 
distance dM(xi,Xj) of numerical attributes as follows: 

s(Xi,Xj) = ^ X (1 - dH{Xt,Xj)/pd) + ^ X {1- dM{Xi,Xj)), 

where pd and Pc denote the numbers of nominal and numerical attributes, re- 
spectively. □ 
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2.2 Modification of Equivalence Relations 

Objects should be classified into the same category when most of equivalence 
relations commonly regard them indiscernible. However, depending on the value 
of Thi, there could be some equivalence relations which classify these similar 
objects into different categories. Such equivalence relations will cause to gen- 
erate unpreferable clustering result consists of small categories. Following is an 
example of the case. 

Let U = {xi, . . . , X 4 } be the entire set of objects and let R = {i?i, . . . , R 4 } 
be a set of equivalence relations over U. Suppose that R classifies U as follows: 

U/Ri = {{a;i,X2,a:3},{x4}}, 

U/R 2 = {{XI,X2,X3},{X4}}, 

U/Rs= {XI,X2,X3,X4\, 

[//i?4= {{X3,X4},{XI,X2}}, 

U/INDCR) = {{a:i,a;2},{x3},{a;4}}- 

In this case, equivalence relation R 4 classifies objects X 2 and X 3 into different 
categories although other three relations classify them into the same category. 
Consequently, three fine categories are obtained. To avoid excessive generation 
of categories, we modify similar equivalence relations so that they represent 
the same, more simplified knowledge. First, we define subordination degree, 
7 (i?i, i?j), of two equivalence relations {Ri, Rj) as follows. 

[Definition 2] Subordination degree of equivalence relations 

l{R^,Ro) = <^7 , 

- _ f 1, if [xi]R^ n [xj]R. ^(j>\ 

\ 0, if [xi]R^ n [xj]R^ =(/'■/’ 

where #(E) denotes cardinality of a set Y. □ 

[Definition 3] Modification of equivalence relations 

Let Ri,Rj G R be initial equivalence relations and let R^, i?' G R' be equivalence 
relations after modification. For an initial equivalence relation Ri, a modified 
equivalence relation i?' is defined as 

R'i = {{xj\xj G Pi}, {xj\others}}, for all j(l < j < n), 

where Pi denotes a subset of objects represented by 

Pz= U {xj\^{Ri,Rj) >Th2}- 

l<j<n 



The value T /12 denotes the lower threshold value to regard Ri and Rj as the 
same relations. □ 
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2.3 Evaluation of Validity 

Depending on the values of two thresholds Thi and T/ 12 , a variety of sets of 
equivalence relations can be obtained in the preceding steps. We then evaluate 
validity of their clustering results based on the following criteria and obtain the 
best set of equivalence relations which yields maximum validity. 

[Definition 4] Validity of clustering result 

Let U denote the entire set of objects, R denote an initial set of equivalence 
relations, and R' denote the modified set of R, respectively. Suppose that R' 
classifies U into I categories, U/IND{R') = {Ci, C 2 , ..., C;}. 

Validity of the clustering result, V(R'), obtained using R' is defined by 

where #(Cfc) denotes the number of objects in the fc-th category, RC^, and RC^ 
denote R-lower and R-upper approximations of Ck given below: 

RCfe = {[xi]jii\[xi]R- C Cfe}, 

XiGCk 

RCfc = [J {[xi]R^\[xi]R^^\Ck ^ (j)}. 



2.4 Feature Extraction by Value Reduct 

After classification is performed, we examine features of classified objects based 
on the concept of value reduct. Note that we here regard ’value reduct’ as ’a set 
of attributes which are essential to specify an object’ and define it as follows. 

[Definition 5] Reduct 

Let R, P and Q denote a set of equivalence relations, a subset of R and a proper 
subset of P, respectively. Here we define a reduct of object Xi as 

D = {^*(P) I [xi]-p C [xi]R, [x,]q g [xi]R}, for all P C R! Q C P, 

where Ai(P) denotes a set of attribute values of Xi associated with a set of 
relations P. □ 

3 Experimental Results 

The method was first applied to the BALLOON database [ 3 . This database 
contained 20 objects and each data had 5 nominal attributes. TableO] shows the 
clustering result. Here, the row ’/r’ denotes rough membership degree of each 
object to its corresponding class. Objects 1-8 certainly = 1.0) belonged to 
class 1, and objects 9-16 belonged to class 2 or 3 also with high membership 
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Table 1. Clustering result on the BALLOON database. 
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Fig. 1. Clustering result on the numerical data. 



grades. Objects 17-20 were divided into four classes since merging these classes 
into other class required excessive modification of equivalence relations. Figure 
n shows the clustering result of numerical data generated using Neyman-Scott’s 
method The data consisted of 98 objects. As shown in Figure ^ the objects 
were clearly divided into the expected clusters. 

As a practical application, this method was applied to the analysis of par- 
tially mutated anti-lysozyme antibody HyHEL-10 0- The data contained 35 
objects. Each object had 16 nominal attributes (selected amino-acid sequence) 
and a numerical attribute (measured value; combining coefficients Ka). Relations 
between amino acids and Ka were examined as follows: (1) Classify antibodies 
according to Ka ISect. ETI2.3B . (2) Extract amino acid residues that character- 
ize each antibody (Sect.ISJ. (3) Evaluate relations between the classification 
result and extracted features of antibodies. Table |3 shows the clustering result. 
Reduct of each antibody is denoted by []. A remarkable feature was found on 
antibody #23, in which all residues were marked as reduct. This implies that 
antibody #23 is a base antibody for mutation. Class 11 contained antibodies #5! 
#9! #20 and #34 that lost affinity for antigen (represented by ND). In these 
antibodies, one of the residues in VH33, VH50, VH98, VL92 had been mutated 
to Ala. Since other antibodies which had mutation in these sites but to other 
amino acids (for example, #6 (VH33=Leu), #8 (VH33=Trp), #10 (VH50=Leu) 
and #11 (VH50=Phe)) did not lose affinity, mutation in these sites to Ala may 
be the reason to reduce affinity of the antibody to the antigen. 
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Table 2. Clustering result on the amino-acid data. 
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4 Conclusions 

This paper has proposed the clustering method based on rough sets. In the ex- 
periments on amino-acid data analysis, 35 objects were classified into 11 clusters 
with adequate steps of attribute values among the classes. This indicates that 
integration of equivalence relations successfully suppressed excessive generation 
of clusters. Besides, the results on artificial data showed that this method can 
handle both nominal and numerical attributes and produce good clustering re- 
sults. It remains as a future work to investigate behavior of the method when 
thresholds are independently assigned to each object. 
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Abstract. In this paper, quantization errors of individual variables 
in fc-means quantization algorithm are investigated with respect to 
scaling factors, variable dependency, and distribution characteristics. 
It is observed that Z-norm standardation limits average quantization 
errors per variable to unit range. Two measures, quantization quality 
and effective number of quantization points are proposed for evaluating 
the goodness of quantization of individual variables. Both measures are 
invariant with respect to scaling/variances of variables. By comparing 
these measures between variables, a sense of the relative importance of 
variables is gained. 

Keywords: k-means, quantization, scaling, normalization, standarda- 
tion 



1 Introduction 

Unsupervised clustering algorithms are important tools in exploratory data anal- 
ysis. Because clustering criteria is usually based on some distance measure be- 
tween individual data vectors, they are highly sensitive to the scale, or dispersion, 
of the variables. It is easy to come up with examples where the clustering re- 
sult can be considerably changed by a simple linear rescaling of the variables 
(see e.g. m p-5). Therefore, apart from the case when the original values of 
the variables are somehow meaningful with respect to each other, some kind of 
rescaling or standardation procedures are normally recommended prior to the 
clustering HEEI 

The most common standardation procedure is to treat all variables inde- 
pendently and transform each to so-called Z-scores by substracting the mean 
and dividing by the standard deviation of each variable. Another widely used 
method is to normalize the range of the variable to unit interval. Also other 
nonlinear, multidimensional, and even local standardation operations are pos- 
sible m m- However, what these more complex may gain in flexibility, they 
loose in interpretative power. 

In this paper, vector quantization of numerical data sets is investigated with 
respect to the quality of quantization of individual components. The aim is to 
derive easily understandable measures of quantization quality to be used as part 
of a data understanding framework. 
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2 Definitions 

Let D he a, n X d sized numerical matrix such that each row of the matrix 
corresponds to one data sample x^, and each column to one variable Vj. Without 
loss of generality, let each v^- be Z-normed to zero mean and unit variance. 
Scaling D with a set of scaling factors {fj} is a simple multiplication operation: 
Vj = fjVj, which results in a new data set D. The variances of the new variables 
are equal to squared scaling factors (j| = /J = /?. 

The scaled data set D is quantized (or clustered). In this paper, the batch 
k-means algorithm is used for quantization m The algorithm finds a set of k 
prototype vectors ihi which minimize the representation error: 

- n d ^ n d d 

= 51 “ II (1) 

i=l j=l i=l j=l j=l 

where bi = arg; min(||xi — m;||), || • || is euclidean distance metric, and Ej is the 
average quantization error of scaled variable Vj . 

The variable-wise quantization errors Ej can be represented as functions 
of the number of effective number of quantization points kj, ie. the number 
of quantization points which are needed to get the same error when only Vj is 
quantized. The A:-means algorithm finds a local minima of E in the space defined 
by kj. The derivative of E gives insight to what happens when the importance 
of a variable increases: 



SE \ ^ \ ^ ^2 *^-^7 ^^7 /c\\ 

=/ 2 
Skji ^ ^ Skj> ^ 5kj Sk.il 

J j—'i- 

This shows that the allocation of quantization points is dependent on three fac- 
tors: the variance of each variable (t|, distribution characteristics of the variable 

and dependencies between variables: In general, since the total supply 

of quantization points k is limited, the partial derivatives j ^ f are neg- 
ative. On the other hand, for those variables which are (highly) dependent on 
variable j', is positive, and thus increased kji actually benefits them. 

The function Ej(k) can be estimated directly from data by making a number 
of quantizations of each variable with varying values of k. This is relatively 
light operation compared to quantization of the whole data space. For uniform 
distribution, the function is also easy to derive analytically: Ej{k) = d^k~‘^, 
in which case kj = {dj /Ej)'^'^ . A similar formula can also be used for other 
continuous distributions, for example for gaussian distribution Ej{k) k, 

For each variable- wise quantization error Ej, the minima is reached when 
kj = k, and maxima when kj = 1. In the latter case the data is quantized 
using just its mean, in which case Ej = a'j . Thus, the quantization errors Ej are 
limited by: 

Ej{k) < Ej < a]. 



(3) 
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Since Ej = fjEj and fj = aj the limits can also be defined with respect to the 
original unsealed variables: 

E,{k) <E,<1. (4) 

The quantization quality of a variable can be estimated as how close Ej is to 
the minimum possible error Ej(k)\ 



Ej Ej{k) _ Ej Ej{k) 
I — Ej{k) C7j — Ej{k) 



(5) 



With sufficiently large k, Ej (fc) = 0 and thus qj ~ Ej j a'^j . Since quantization 
quality is intricately linked with variable importance — the more important 
variable, the better it is quantized — qj acts as a measure of variable importance. 



3 Experiments 

To get further insight to how the proposed methods work in practice, some tests 
were made using artificial data sets. The number of data points was 1000, and 
they were quantized using 100 quantization points. To ensure good quantization, 
ten /c-means runs with 50 epochs were made and the best one was utilized. The 
data points were from four different kinds of distributions: gaussian, uniform, 
exponential and “2-spikes” , which was formed as a mixture of two supergaussian 
distributions with equal prior probabilities. 

Figure n studies the effect of scaling factors in a 10-dimensional case. Instead 
of a steady increase in quantization error Ej of the scaled variable, which might 
be expected, there is a transfer area for scaling factors in range [1,10], where 
the error of the scaled variable is about equal to the quantization error of all 
the other variables. The behaviour is due to the limits imposed on Ej by the 
possible values of kj. With small scaling factors, fci Ri 1, while for large scaling 
factors fci Ri 100. 

The 2-spikes distribution (Figure Q] top right corner) has a sudden decrease 
in quantization error for 1 < /i < 5. This decrease shows the effect of increasing 
fci over the threshold of 2: at this point, the quantization points are divided to 
two groups, one for either spike of the variable. 

Both importance measures qj and kj work very well in all cases showing how 
the importance of the first variable increases with increasing scaling factor in the 
significant range of scaling factors, and levels off when the scaling factor does 
not really matter. 

Figure 0 studies the effect of variable dependencies. In the test, 10- 
dimensional data sets with 1 to 9 identical variables were quantized. The quan- 
tization errors behave exactly as if there were actually 10 to 2 variables, respec- 
tively: the sum of errors of the dependent variables is equal to the errors of each 
of the independent variables. 

As an example of the usage of the proposed indicators, the IRIS data set 0 
was quantized using 15 prototypes, see Table0 Both measures qj and kj indicate 
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Fig. 1. Quality of quantization. Ej on top row, qj in the middle, and kj on the bottom. 
The data consists of 10 similar variables (gaussians on the left, uniform and exponential 
on the middle, and 2-spikes distributions on the right) the first of which is scaled by 
a factor of /i £ [0.1,316]. The solid line corresponds to the first variable, and the 
separate markers (actually: boxplots) to the other 9 variables. Note that all axes are 
logarithmic. 



that petal length variable was the most important variable in this quantization, 
although its descaled quantization error {A column) is the biggest of the four 
variables. 

Table 121 shows results for a modified version of IRIS data set. Four new vari- 
ables have been added: one discreet variable which indicates the class information 
of the sample (values 1, 2 and 3 for the three different Iris subspecies) and three 
random uniformly distributed variables. The random variables are clearly the 
least important. 



Table 1. Quantization of the IRIS data set with 15 prototypes. Both min, max and A 
values are in original value range in order to faciliate interpretation by domain experts. 



Variable 


[min, max] 


qj kj 


sepal length 
sepal width 
petal length 
petal width 


[4.3,7.9] 

[2.0,4.4] 

[1.0,6.9] 

[0.1,2.5] 


±0.2 0.063 5.07 
±0.1 0.096 4.29 
±0.3 0.018 5.62 
±0.1 0.033 3.87 



Importance of Individual Variables in the fc-Means Algorithm 



517 




Fig. 2. Ratio between quantization errors Ej of the first and 10th variables (out of 
10) when the first 1 to 9 variables were identical: Eio = cEi where c is the number of 
copies. 

Table 2. Quantization of the augmented IRIS data set with 20 prototypes. 



Variable 


[min, max] 


z\ 


1j 


kj 


sepal length 


[4.3, 7.9] 


±0.4 


0.2 


2.69 


sepal width 


]2.0,4.4] 


±0.3 


0.33 


2.29 


petal length 


[1.0, 6.9] 


±0.3 0.036 3.60 


petal width 


[0.1, 2.5] 


±0.2 0.065 2.94 


iris species 


[1.0, 3.0] 


±0.2 0.062 2.75 


random 1 


[0.0, 1.0] 


±0.1 


0.23 


2.06 


random 2 


[0.0, 1.0] 


±0.2 


0.33 


1.86 


random 3 


[0.0, 1.0] 


±0.2 


0.3 


1.89 



4 Discussion 

Various studies have investigated and compared different kinds of standarda- 
tion/scaling methods in clustering problems. For example in 0 several stan- 
dardation procedures were compared to each other in an artificial clustering 
problem. Standardation based on range was often found to be superior to the 
Z-norm standardation. This is understandable since clustered distributions, for 
example dicreet variables, retain more of their variance than continuous variables 
in scaling by range. Thus they have, in the view of the results in this paper, big- 
ger inherent scaling factors. However, Z-norm provides a more uniform starting 
point in quantization, since the maximum quantization errors are equal for all 
variables. 

The importance of a variable can be viewed as the gain — decrease in the 
quantization error — the quantization algorithm achieves through increasing 
the effective number of quantization points kj of some variables, and (therefore) 
decreasing kj of the others. The allocation of kj seems to depend primarily on 
three factors: scaling of the variables, their distribution characteristics, and their 
dependency on the other variables (see Eq. Ej) . Of these scaling has quite straight- 
forward effect, and the effect of distribution characteristics can be assessed by 



518 



J. Vesanto 



calculating the 1-dimensional quantization errors to a range of /c-values. The 
third factor is the most problematic, and also the most interesting, because 
it appears to allow a way to investigate variable dependencies through vector 
quantization. 

Variable importance and quantization quality are important pieces of infor- 
mation when analysing and interpreting a quantization or a clustering result. 
The final quantization error of a variable — even when compared to errors of 
the other variables — does not by itself give very clear picture of the quanti- 
zation quality of the variable. In this paper, two measures qj and kj have been 
proposed which are well suited for evaluating the quantization quality of single 
variables. 
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Abstract. Current clustering methods always have such problems: 1) High FO 
cost and expensive maintenance; 2) Pre- specifying the uncertain parameter k\ 3) 
Lacking good efficiency in treating arbitrary shape under very large data set en- 
vironment. In this paper, we first present a hybrid-clustering algorithm to solve 
these problems. It combines both distance and density strategies, and makes full 
use of statistics information while keeping good cluster quality. The experi- 
mental results show that our algorithm outperforms other popular algorithms in 
terms of efficiency, cost, and even get much more speedup as the data size 
scales up. 



1 Introduction 

Current clustering algorithms often consider some single criterion and take fixed strat- 
egy alone. Because of such limitations, these methods always have advantages in 
some aspects but weak in other aspects. Moreover, in very large databases the already 
existed information is not fully utilized. Another problem is the pre-specified k, which 
is unreasonable to determine before moving forward to the final goal. We think that 
the following requirements for clustering algorithms are necessary: to achieve good 
time efficiency under very large datasets, to identify arbitrarily shaped clusters, to 
remove noise or outliers effectively and to cluster without any pre-specified k. 

In this paper, a new clustering algorithm is proposed. It works on a hierarchical 
framework and takes hybrid criterion based on both distances between clusters and 
density within each cluster. This hybrid method can easily identify arbitrarily shaped 
clusters and can be scaled up to very large databases efficiently. 

The rest of the paper is organized as follows. We first generalize the related work in 
1.1. In Section 2, a new clustering algorithm is presented. Section 3 discusses its en- 
hancement behavior. In Section 4, we show the experimental evaluation of the algo- 
rithm. Finally in Section 5, concluding remarks are offered. 
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1.1 Related Work 

An agglomerative algorithm for hierarchical clustering starts with the separate set of 
clusters, which is each data point under initial case. Pairs of sub-clusters are then suc- 
cessively merged until the distance between clusters satisfies the minimum require- 
ments. CURE [1], a hierarchical algorithm, uses multi-representative data points in 
order to control arbitrary shape well. 

DBSCAN [2] relies on a density-based notion of clustering. They are designed to 
discover clusters of arbitrary shapes. DBSCAN uses R*-tree to achieve better per- 
formance. But when the data size is very large, DBSCAN needs frequent I/O swap to 
load data into memory. And it also needs an uncertain choice of optimal cut point. 

Wang etc. gave STING [3], which divides the spatial data into rectangular cells 
using a hierarchical structure with statistical information stored together. However, in 
hierarchy, the parent cell may not be built up correctly for the reason of statistical 
numbers, although high efficiency are obtained. When the agglomerative procedure 
moves on, the cells cannot represent the precise information they originally have. 



2 Hybrid Clustering Algorithm Based on Distance and Density 



2.1 Hybrid Clustering Algorithm 

The hybrid algorithm needs three parameters: M-DISTANCE, M-DENSITY and M- 
DIAMETER. M-DIAMETER will be introduced later. 

Definition I: M-DISTANCE is the minimum distance between two clusters. 
Definition 2: M-DENSITY is the minimum value among each density, which is the 
number of data in a cell belonging to corresponding cluster. 

The main clustering algorithm starts from original sub-clusters (including units or 
data points). It is detailed as below: 

1. CLUSTERING (M-DISTANCE, M-DENSITY) 

2. { sort the sub-clusters in heap; 

3. for each sub-cluster i with minimum distance 

between i and i. closest 

4. {if (distance (i, i. closest) < M-DISTANCE) 

5. merge (i, i. closest); 

6 . else 

7. if (CONNECTIVITY (i, i. closest, M-DENSITY) 

== TRUE) 

8. merge (i, i. closest); 

9. else 

10. note i & i. closest aren't connected;}} 

The procedure obtains sub-cluster with minimum distance between itself and the 
closest sub-cluster to it. If this distance is smaller than M-DISTANCE, then two sub- 
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clusters must belong to one cluster. There exists such situation that some sub-clusters 
that should be merged, but the distance between them is large. We test the connec- 
tivity of such sub-clusters. Two connected sub-clusters must belong to one cluster. If 
these two sub-clusters could not connect to each other, they must belong to two clus- 
ters unless the distance between them is smaller than M-DISTANCE. 

We make use of statistics information to test the connectivity of sub-clusters. 
Definition 3: A cluster’s diameter is the maximum distance of two data points in it. 
Definition 4 : M-DIAMETER is the minimum diameter clusters may have. 

Definition 5: A cell is a small data grid, the length of whose diagonal distance is 
smaller than min { 1/2*M-DISTANCE, M-DIAMETER}. 

Then some properties about cell can be presented. Since the lack of space, the proof 
is omitted here. 

Theorem 1: There will not be any cell that contains data points belonging to two dif- 
ferent clusters, and there will not be any cell that contains a whole cluster either. 
Theorem 2: If cell i belongs to cluster A, cell j is a neighbor cell of i and den- 
sity (/)>M-DENSITY, then cell j must belong to cluster A too. 

Definition 6: Noises are the data points in the cell whose density is smaller than M- 
DENSITY. 

Finally, the following procedure is to judge the connectivity of two sub-clusters. 

1. CONNECTIVITY (cluster_i, cluster_j , M-DENSITY) 

2 . { QUEUE q; 

3 . for each cell k in cluster_i 

4. q.ADD (k) ; 

5. for each neighbor cell 1 of cells in q AND 1 do 

not in q 

6. {if (density (1) > M-DENSITY) 

7. if (1 .belongto==cluster_j ) 

8. return TRUE; 

9 . else 

10. { q ■ add ( 1 ) ; 

11. merge (cluster_i, 1 . belongto) ; 

12. for each cell m in 1. belongto 

13. q.ADD (m);}} 

14. return FALSE;} 



2.2 Scale Up to Very Large Databases 

To handle very large databases, the algorithm constructs units instead of original sub- 
clusters while using sampling. To obtain these units, first, partitioning work is done to 
make the data points into cells. Some statistics information is also obtained in each 
dimension of every cell. Then we test to see if a cell forms a unit. The definition of 
unit is as below: 

Definition 7: A unit is a cell whose data points belong to a certain cluster. 

The density of units is bigger than that of other cells. Therefore, we determine if a 
cell is a unit by density. If the density of a cell is M (M*l) times the average density of 
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all the data points, we regard such cell as a unit. There may be some cells with low 
density but are in fact parts of final clusters, they may not be included into units. We 
treat such data points as separated sub-clusters. This method will greatly reduce the 
time complexity, as will be shown in the experiments. 



2.3 Labeling Data in Databases 

The clusters identified in this algorithm are denoted by the representative points. In 
most cases, user needs to know the detailed information about the clusters and the data 
points included. Therefore, we need not test every data point which cluster it belongs 
to. Instead, we can only find the cluster a cell belonging to, then all data points in it 
belong to such cluster. This process can also improve the speed of the whole clustering 
process. 



2.4 Time and Space Complexity 

The time complexity of our clustering algorithm can be O {rn*\ogm) in upper case. 
Here, m is the number of sub-clusters in the beginning. In general, when the data dis- 
tributes in well proportioned dense area, the m will be small. Because the cell and data 
are stored in linear space, the space complexity needed in our method is O (n). 



3 Enhancements for Different Data Environment 



3.1 Handling Noises and Outliers 

Noises are random disturbance that reduces the clarity of clusters. In our algorithm, 
we can easily find noises and wipe off them by finding the cells with very low density 
and eliminate the data points in them precisely. This method can reduce the influence 
of the noises both on efficiency and on time. 

Unlike the noises, outliers are not well proportioned. Outliers are data points that 
are away from the clusters and have smaller scale compared to clusters. So outliers 
will not be merged to any cluster. When the algorithm finishes, the sub-clusters that 
have rather small scale are outliers. Our algorithm can determine this scale by the 
parameter M-DIAMETER, which denotes the minimum diameter a cluster may have. 



3.2 Handling Data Sets Having Clusters with Arbitrary Shape 

By using multi-representatives technique and distance-plus-density strategy, the hybrid 
algorithm can accurately identify most arbitrarily-shaped clusters which are difficult to 
be processed by other methods, as shown in Fig. 1. 
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Shape 




Fig. 1. Data sets with clusters in dumb shape 




STING 



4 Performance Evaluation 

The experiments were run under such environment as: Microsoft Windows NT 4.0, 
Intel Pentium 11 350 x 2, and 512M RAM. We research the performance of our hybrid 
clustering algorithm to see its effectiveness and efficiency for clustering compared to 
DBSCAN and CURE. 

Fig. 2 illustrates the performance of the hybrid-clustering algorithm and CURE as 
the number of sample size from 1000 to 6000. It shows that our algorithm far outper- 
forms CURE while keeping the good clustering quality 




The hybrid-clustering algorithm can successfully handle arbitrarily large number of 
data points. Fig. 3 illustrates the performance of our algorithm, DBSCAN, and CURE, 
as the number of data size from 30,000 to 1,000,000. 



5 Conclusions 

In this paper, we present a hybrid-clustering algorithm. This algorithm identifies the 
clusters both by distance between clusters and density of within clusters. The algo- 
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rithm can easily identify arbitrarily shaped clusters with good quality, and user need 
not pre-specify the number of clusters. With the help of statistics information, it 
greatly reduces the computational cost of the clustering process. Our experimental 
results demonstrate that it can outperform other popular clustering methods. 
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Abstract. Data warehouse is an information provider that collects necessary 
data from individual source databases to support the analytical processing of de- 
cision-support functions. Recently, research about the indexing technologies of 
data warehousing has been proposed to help efficient on-line analytical proc- 
essing (OLAP). In the past decades, some novel indexing technologies of data 
warehousing were proposed to retrieve the information precisely. However, the 
concept of similarity indexing technology in the increasingly larger data ware- 
housing was seldom been discussed. In this paper, the performance issue of ap- 
proximation indexing technology in the data warehousing is discussed and a 
new similarity indexing method, called bit-wise indexing method, and the corre- 
sponding efficient algorithms are proposed for retrieving the similar cases of a 
case-based reasoning system using a data warehouse to be the storage space. 
Some experiments are made for comparing the performance with two other 
methods and the results show the efficiency of the proposed method. 



1 Introduction 

Data warehouse is an information provider that collects necessary data from individual 
source databases to support the analytical processing of decision-support func- 
tions[14]. Recently, research about the indexing technologies of a data warehouse has 
been proposed to help efficient on-line analytical processing (OLAP). A critical issue 
of performance for data warehousing is to retrieve necessary tuples according to the 
query statements. Many researchers have proposed the useful indexing technologies to 
retrieve records precisely. 

Case-based reasoning (CBR) is a methodology of problem solving in AI [3]. Just 
like human being, CBR uses prior cases to find out suitable solution for the new prob- 
lems. The method of CBR uses useful prior cases to solve the new problems. It has 
been successfully applied in many areas [2][4][8][10][13]. The major tasks of CBR 
can be divided into five phases, including Case Representation, Indexing, Matching, 
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Adaptation and Storage. A critical task of CBR is to retrieve similar prior cases ac- 
curately and many researchers have proposed some useful technologies to handle such 
problem [7][11][14]. However, performance of retrieving similar cases was seldom 
been discussed. When the number of cases in the case base is large, the processing 
time for retrieving similar cases increases rapidly. Therefore, the process of retrieving 
similar cases becomes an overhead task of CBR. Thus the retrieving time should be 
taken into consideration in a successful CBR. In this work, we propose a novel index- 
ing method with suitable similarity-measuring function. The corresponding case re- 
trieving algorithm of CBR using a data warehouse to be the storage space has also 
proposed to highly accelerate the performance of cases indexing and retrieving. The 
new indexing method and the corresponding algorithm are easy to be parallelized and 
thus improve the performance substantially in case retrieving and similarity measuring 
in the data warehousing. Finally, the experiments and the comparison with the tradi- 
tional relational database and bitmap index method of the data warehousing have been 
made. The results show the correctness and efficiency of the proposed method. 



2 Related Works 

In this section, some related topics would be discussed, including data warehousing, 
case-base reasoning and bitmap indexing method. 

The concept of data warehousing was first proposed by Inmon [14] in 1993. A data 
warehouse contains information collected from individual data sources and integrated 
into a common repository for efficient querying and analysis. When the data sources 
are distributed over several locations, a data warehouse is responsible for collecting 
the necessary data and saving it in appropriate forms. 

Case-based reasoning (CBR) is a methodology of problem solving in AT The 
method of CBR reuses past cases to solve the new problems. The success of a CBR 
system mainly depends on an effective retrieval on similar cases for the problem; 
therefore, the indexing and matching thus become the important tasks in CBR 
[6] [7] [11]. In the similarity-based matching function, the weights provide a surrogate 
method of representing the complex interrelationships in similarity measurement, and 
represent the degree of importance of a feature toward the goal of the solving problem. 
A critical task of CBR is to retrieve similar cases accurately and many researchers 
have proposed some useful technologies to handle such problem[5]. However, per- 
formance of retrieving similar cases has seldom been discussed. Retrieving similar 
cases needs more time when the matching function becomes more complex or the 
number of cases in the case base becomes very large. Therefore, how to quickly re- 
trieve similar cases becomes an important issue in CBR and the retrieving time should 
be taken into consideration in a successful CBR. Since retrieving cases in attributed 
based CBR is similar to retrieving records in DWs, bitmap indexing technology seems 
to be able to be directly applied to indexing and retrieval phrases in CBR[9]. However, 
there are still some problems should be solved: 
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1 . Prior cases are most likely not the same as the new case in CBR. Rather than exact 
matching that can be used to retrieve records in DWs, partial matching may be re- 
quired to retrieve prior cases in CBR. 

2. Some extra computation of the similarity between the new case and prior cases is 
required. 

In other words, it is unsuitable to straightly apply the bitmap indexing method to the 
indexing phase of CBR. It needs some adaptation and we will discuss the details of our 
new indexing method. 



3 Bit-Wise Indexing CBR 

3.1 Architecture of Bit-Wise Indexing CBR 

As described above, we propose a novel indexing method with suitable similarity- 
measuring function to speed up retrieving similar cases in CBR. The architecture of 
Bit Wise Indexing CBR (BWI-CBR) is shown in Fig. 1. Most parts of BWI-CBR are 
the same with that of general CBR, except the following: 




Fig. 1. The architecture of BWI-CBR 

1. Bit-wise indexing phase: replace the indexing method in traditional CBR with bit- 
wise indexing method. It can highly speed up retrieving time in the Matching 
phase. 

2. Matching phase: a. Retrieving relevant cases phase: To match bit-wise indexes 
between the new case and prior cases, we can select relevant prior cases and filter 
out irrelevant cases. Moreover, the matching result of bit-wise indexing can be used 
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to calculate the similarity degree of prior cases directly in Similarity measurement 
phase, b. Similarity measurement phase: Computing the similarity between the 
relevant cases and the new arrival case is to find the similar cases in case base. The 
similarities between relevant prior cases and the new cases cannot be computed 
without knowing which attributes with the same values. To solve the problem, we 
use the Mask Vector. We can pre-compute all-possible similarities and construct 
the Similarity Mapping List. Accordingly, the similarity of each prior cases and 
new arrival case can be quickly found out by seeking the Similarity Mapping List. 
Therefore, the computing overhead can thus be largely reduced. 

In our approach, we first use the bit-wise indexing method to replace the traditional 
indexing method in CBR. Then the bit-wise operations can be used to select the rele- 
vant prior cases in retrieving relevant cases phase. By this way, the irrelevant cases 
can be filtered out quickly and the number of prior cases, which are needed to com- 
pute the similarities with the new case, can be reduced. Therefore, the similarities 
between relevant prior cases and the new case can be quickly measured in similarity 
measurement phase. 



3.2 Some Definitions of Bit-Wise Indexing CBR 

Assume there is a set of cases, called examples, denoted as C=<Cj, C> where 

i>\, needs to be stored in the CBR in a specific domain, denoted as DOM, for reason- 
ing. Assume A is the set of attributes and all cases of C in the domain DOM can be 
abstracted into d attributes, denoted as A=<A^A 2 ,---Ad>- For each attribute of case 
Cj, its attribute value is denote as Vjij) and V^{j)^null. Moreover, Denote V.(/)=<Fi(j), 
Fa(j)> is the attribute value set of case C. Denote V.=<Vn, yiO(,,> where 

V.j is a possible attribute value of A., that is, V-j=V^{x) for some C^, C^eC, and for 

j^k. a{i) is the number of values in A., all elements of V- is the collection of all attrib- 
ute values of attribute A. of C, called attribute value domain of attribute A. of C. In a 
CBR system of domain DOM, the cases of C need to be stored in the CBR system for 
solving the new arrival case. A, the set of its significance attributes, acts like indexes 
to books in a library, helps the CBR system to select cases likely to fulfill the needs of 
the problems for new arrival case. When retrieving cases, matching function is used to 
retrieve cases based on a weighted sum of matched attributes in the input cases. At- 
tributes can be viewed as indexing features of a design case or as the decision vari- 
ables relevant to the original design situation. An index of case can be formally de- 
fined as follows. 

DEFINITION I (Index of Case in CBR system for Domain DOM): The index /AD*, 
of a case in CBR system for domain DOM is defined as INDi^={Ai=Vi(k), 
A2=V2(k),...,As=Vs(k)}. 

Example 3.1: As shown in Fig. 2(a), there are five cases, each of which has three 
attributes OS, PL and DB. Therefore, d is equal to 3. Attribute values domain of OS, 
PL and DB are Vj=<WinNT, OS2, Linux, Mac, Solaris>, ¥^=<0, Basic, Java, Pascal> 
and y 3 =<SQL-Server, ORCALE, SYBASE>. The index of Case 1 is written as 
{OS=WinNT, PL=C, DB=SQL-Server}. 
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In CBR, cases represent an experienced situation. This experienced situation con- 
structs knowledge which can he used in future. When a similar situation arises, the 
knowledge that goes into making them provide a starting point for interpreting the new 
situation or solving the problem it poses. 

Formally, let Cv be the set of the case’s contents including case description repre- 
senting the problem needed to be solved at that time, the stated or derived solution to 
the problem specified in the problem description, and the resulting state when the 
solution was carried out. A case in CBR can be formally defined as follows. 
DEFINITION 2 (Case in CBR system for Domain DOM): The case in CBR sys- 
tem for domain DOM is a twin {IND,., cvj, where cv^gCv and C^gC. 

Example 3.2: We assume that a software company wants to design software, which 
allows multi-user access the data stored in SQL database. The Case 1 of Fig. 2(a) can 
be transformed as follows: 

Casel {Index IND, : Operation System=WinNT, Program language=C language and 
Database=SQL-server. Contents of the case C/. It adopts: WinNT as the OS, 
C language as developing tools and SQL-server as database. Result : good 
performance. } 



(a) An example Case base 
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(b) The bitmap indexing for (a) 
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Fig. 2. An example case base and corresponding bitmap indexes 



A case of Case-based reasoning using the proposed indexing method, called bit-wise 
indexing method, is formally defined as follows: 

DEFINITION 3 (the i-th attribute bit-wise indexing vector of the case): The bit-wise 
indexing vector B. of the i-th attribute for the case in C is a bit string. 
b.j=\ if y_(fc)=y^. otherwise b.j=Q, where Q<i<d. 

DEFINITION 4 (bit-wise indexing vector of case): A bit-wise indexing vector BWI^ 
of case Cj is a concatenation of the attribute bit-wise indexing vectors. That is, 
BWI^=BJ3^...Bg for 5 attributes. 

Example 3.3: According to the DEFINITION 3 and 4, the bit-wise indexing vector of 
attributes OS, PL and DB for Case 1 in Fig. 2(a) are £,="10000", 1000" 

£3="100", respectively. We have the bit-wise indexing vector £147, of Case 1 is 
B^B,B, ="100001000100". 
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DEFINITION 5 (Matrix of bit-wise indexes for case-based reasoning): A matrix of 

r Bwi, 1 

bit-wise indexing for CBR is written as 



Example 3.4: According to DEFINITIONS, the of the cases of Fig. 2(a) is shown 
as below: 



BWI, 


10000 


1000 


100 


BWf 


01000 


0100 


010 


BWf 


00100 


0010 


001 


BWL 


00010 


0010 


010 


BWf 


00001 


0001 


100 



4 The Phases of Bit-Wise Indexing CBR 

4.1 Indexing Phase in Bit-Wise Indexing CBR 

Unlike the bitmap indexing method which takes each attribute value as a bit vector to 
constructs indexes, Bit-Wise Indexing method views each case as a bit vector. For the 
cases shown in Fig. 2(a), Fig. 2(b) is the bitmap indexes. When a case (OS=WinNT; 
PL=Java; DB=ORCALE) arrives, some partial matched cases may be obtained by 
checking (5^,^^^,, AND Bj^^J, AND Bj^^^), AND 

and (B^_^j^ AND Bj^^^ AND B^^^^^) vectors for bitmap indexing method. To compute 
the similarity of the ith case and the new case, we need to check the ith bit position in 

(B„,„„^ and B,_), (B„,„„^ and B,^J, (B„,„„^ and and 

(BwinNT and Bj^^^ and vectors. Once the attributes of cases or the number of 

cases are large, the computing time of scanning to the i-th position in vectors is dra- 
matically increasing. Therefore, we propose a new indexing method, which is suitable 
for similarity computing. Let the length of bit-wise indexes /= . The bit-wise indexes 

creMon algorithm and the matrix of bit-wise indexes creation algorithm are shown as 
follows: 

Algorithm 4.1 (Bit-wise indexes creation algorithm): 

Input : C. of C. 

Output : The BWL of 

Step 1: Create bit-wise vector BWL of case C- with length 1. 

Step 2:For each bit of BWL, if VfS)=V^^, E=l: otherwise, bj=0. 

Step 3: Return BWL. 

Algorithm 4.2 (Matrix of bit-wise indexes creation algorithm): 

Input : C of CBR. 

Output : The 7^^, of the CBR. 

Step 1: Create an empty matrix and set counter i to 1. 

Step 2:For each case C; in CBR, do the following sub-steps. 
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Step 2.1: Use Bit-wise indexes creation algorithm to get the BWL of C. 

Step 2.2: Add the BWL into 

Step 2.2: if i=ICI, exit sub-procedure, otherwise, set i=i-\-\ and go to Step 2.1. 
Step 3: Return 7^^,. 

After the bit-wise indexing matrix is built, the bit-wise operations can be easily ap- 
plied on it. According to the characteristics of bit-wise indexing method, we will dis- 
cuss our similarity measurement algorithms in following sections. 



4.2 Matching Phase in Bit-Wise Indexing CBR 

For the obtained bit-wise indexing matrix, the bit-wise operations can be used to re- 
trieve similar cases in the CBR easily. However, computing the similarity among all 
cases using matching function is a time-consuming task. A two-phase Similar Cases 
Seeking algorithm, including relevant cases retrieving phase and similarity computing 
phase, is proposed. Since the bit-wise operations are quite fast, the major concern of 
our algorithm is to reduce the computing time by filtering all irrelevant cases before 
calculating the similarities. If all irrelevant cases can be filtered out first, the time of 
retrieving useful prior cases can then be decreased largely. Therefore, in the relevant 
cases retrieving phase, all irrelevant cases will be filtered out quickly and the similari- 
ties of other cases will then be computed efficiently in the similarity computing phase. 
Algorithm 4.3 (Similar Cases Seeking Algorithm): 

Input : The 7^^, and a new case C„. 

Output : A set of similar cases Rc and a set of its corresponding similarity degree Rs. 
Step l:Use Bit-wise indexes creation algorithm to get BWI^ of case with length 1. 
Step 2: Initialize the counter j to 1 and let Rc and Rs, be empty. 

Step 3:For each BWI in 7^,^,, do the following sub-steps (where 1<7<ICI): 

Step 3.1: Call Search-relevant Algorithm to compare the relevant degree rdi. 
between BWL and BWl„. 

Step 3.2: If rdi=0, the rdi is dropped and go to Step 3.4. 

Step 3.3: Call Similarity Computing Algorithm to compute sim., and then add 
sim- and the case- into Rs and Rc. 

Step 3.4: Add 1 to j. 

Step 4: Sort Rc in descending order according to its corresponding similarity degree in 
Rs. 

Step 5: Output Rc and Rs. 

The Retrieving Relevant Prior Cases of Bit Wise Indexing CBR 

For one prior case, if it is relevant to the new case, at least one of its attributes has the 
same value as that of the same attribute of the new case. That is, the bits in the corre- 
sponding positions of the same attributes should be set as " 1 " in their bit vectors and 
can be found by using the AND bit-wise operation to compare these two bit vectors. It 
means the two cases have the same attribute-value for the corresponding attribute. In 
other words, these two cases are similar in some degree. The Search-relevant algo- 
rithm is described in the following: 
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Algorithm 4.4 {Search-relevant algorithm): 

Input: The bit-wise indexing vector BWl^ of a new arrival case Q, and BWL of case 

Cj in C. 

Output: the relevant degree rdij. 

Step l:Use AND bit-wise operation on and BWL, and then store the result into 
rdi. which is also a bit-wise indexing vector with length 1. 

Step 2:Return rdi. 

Since the AND bit-wise operation takes one instruction execution time, the Search- 
relevant algorithm can be used to select the relevant prior cases quickly. In the real 
implementation, some integers are used to represent the prior case’s indexes, the new 
arrival case’s indexes and rdi. If rdi is zero, then the bit vector of the prior case will be 
filtered. By this way, all irrelevant prior cases can be filtered out efficiently and pre- 
cisely. 

The Similarity Computing of Bit-Wise Indexing CBR 

After all relevant prior cases have been retrieved, in order to select the useful prior 
cases, the similarities between these relevant prior cases and the new case need to be 
computed. As discussed above, we use the matching function to retrieve cases. The 
matching function is based on a weighted sum of attributes for the input case that 
matches the prior cases in the case base. Each attribute has its own weights. The simi- 
larities between relevant prior cases and the new cases can not be computed without 
knowing which attributes with the same values. Since each attribute of every case has 
only one attribute value, at most one bit of rdi is set after executing the Search- 
relevant algorithm. Accordingly, we propose a special bit-wise vector, call Mask 
Vector, to solve the bottleneck of similarity computing phase. Denote 
\/b..= l is the 1-vector of length a{i) and <0>.=b.fi.^...b.i^^.^, \/b.j=0 is the 0-vector of 
length cc(i). The definition of Mask Vector is shown as below: 

DEFINITION 6 (Mask Vector).- A mask bit-wise indexing vector Mask is a set of 
Mask^, where 0<A:<5. The Mask^^, the mask vector of attribute A^, is a concatenation of 
bit string S. where Mask=S^S.^...Sg for 5'j=<l>j and ViVk, S=<0>_. 

Example 4.1: Continuing the Example 3.4 in Section 3.2, the Mask vector Mask^ 
Mask^ and Mask^ of attribute OS can be generated to {11111 0000 000} (00000 1111 
000} and (00000 0000 111} respectively. 

By applying the ’AND’ operation on Mask vector and the bit-wise vector of the re- 
sults generated from search-relevant phase(r<ii), called Mask-Vector processing. 
Based upon these results, the similarities of each attribute for new arrival cases and 
prior cases can be computed easily. 

As discussed above, we proposed a suitable similarity-measuring function for BWI- 
CBR to compute similarity according to the result of Mask-Vector processing. The 
function is: 

t,(PCs^Wp 
SIM {Case. ) = 

Ew'/ 



( 1 ) 
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Where SIM(Case) is the similarity between i-th prior case and new case, W. is the 
weight of the attribute j. If the result of performing AND bit-wise operation on the 
relevant degree rdi- and Mask- is 0 then set PC-^ as 0; set PC^- as 1, otherwise. 

Example 4.2: Assume that the weights of attributes OS, PL and DB, denoted as 
Weight-^, Weighty and Weight,, are set to 1, respectively. Also, assume that the result of 
Mask-Vector processing Cases 1, 2, 3 and 4 are "001", "001", "011", and "100", re- 
spectively. Therefore, the similarity of these cases 0.333, 0.333, 0.667 and 0.333, 
respectively. 

Since the computing results (similar degree) of Case 1 and Case 2 are the same, 
however, it can been computed only once by pre-computing all-possible similarities 
and storing them into the Similarity Mapping List. The computing overhead can be 
largely reduced and the processing time can thus be eliminated. 

DEFINITION 7 (Similarity Mapping List).- Let L be the Similarity Mapping List. L. is 
the element in L with the index value i, where i=b--p-,...b.s, and b.,b.,...b-g is the binary 
representation of i, where I<i<2'®'-1. 

E=tb.xW, ( 2 ) 

>=1 

Algorithm 4.5 (Similarity Mapping List Creation Algorithm): 

Input: Weight for indexes of CBR. 

Output: The similarity mapping list L. 

Step 1: Initialize the counter A: to 1 and let List L be empty. 

Step 2:For each k, do the following sub-steps 

Step 2.1:Encode k into a binary string k=<b.,b.,...b-g>. 

Step 2.2:Calculate the similarity degree by Formula 1 in Definition 8. 

Step 2.3:Add into L. 

Step 2.4:If k =2'®'-l, then exit the processing of sub-steps; Otherwise, \etk = k + 
1 and repeat the sub-steps of Step 2. 

Step 3: Return L. 

After the Similarity Mapping List had been built, the similarity of each prior cases 
and new arrival case can be quickly found out by the following algorithm: 

Algorithm 4.6 (Similarity Computing Algorithm): 

Input: The relevant degree rdij, the Mask Vector and the Similarity Mapping List L. 

Output: The similarity of case,. 

Step 1: Initialize A: to be a binary string with length d. 

Step 2:For each i, set the i-th position of A: to 1 if the result of using AND bit-wise 
operation on the Mask, and rdi, is not all 0; otherwise, set it to 0. 

Step 3: Transform k into an integer j, set L, to sim,. 

Step 4: Return sim,. 

During the pre-processing step, we had constructed the Similarity Mapping List and 
Mask Vector. In Similarity computing algorithm, only the ‘AND’ bit-wise operation 
needs to be done on Mask Vector and bit-wise vectors of relevant case. Therefore, we 
can use the Similarity Mapping List to find out the similarities between relevant prior 
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cases and the new case quickly and easily. Since the redundancy of similarity measure 
is avoided, the computing overhead and time computing similarity for each relevant 
prior case can be largely reduced. 

Example 4.3: In this Example, we will show our Similar Cases Seeking Algorithm in 
detail. Following the Example 3.4, for the new arrival case C^: {OS=Solaris, PL=Java, 
DB=ORCALE}, we have = {00001 0010 010}. In Step 2 of Similar Cases 

Seeking Algorithm, Initial state:(/=l) and set Rc and Rs to empty. For each BWI. in 
do the following sub-steps (where l<y<ICI): 

• For BW/j : The Search-relevant Algorithm is used to compare the relevant degree 
rdii between BWI^ and BVV7„ and the return value rdi^ = (00000 0000 000 }. Because 
the bit of rdi^ are all "0", Case 1 is filtered out. 

• For BWI^ : Similarly, we have rdi^ = (00000 0000 010 }. Since one of the bit in rdi^ 
is equal to "1", the Similarity Computing Algorithm is called to compute sim^=0333 
and then add sim^ and the case^ into Rs and Rc, respectively 
(sim=L=L^^^,=L =0.333). 

• For BWI^ : Similarly, we have rdi^ = (00000 0010 000}. Since one of the bit in rdi, 
is equal to "1", sim, =0.333 and then add sim, and the case, into Rs and Rc. 

• For BWI^ : Similarly, we have rdi^ = (00000 0010 010}. Since one of the bit in rdi^ 
is equal to "1", sim =0.661 and then add sim^ and the case^ into Rs and Rc. 

For BWI^ : Similarly, we have rdi^ = (00001 0000 000}. Since one of the bit in rdi^ 
is equal to "1", sim =0.333 and then add sim^ and the case^ into Rs and Rc. After sort- 
ing the element pairs of Rc and Rs in decreasing order, new Rs and new Rc become: 



Rc: 


Case 4 


Case 2 


Case 3 


Case 5 


Rs: 


0.667 


0.333 


0.333 


0.333 



5 Experiments and Discussions 

To evaluate the performance of BWI-CBR, we compare BWI-CBR with two other 
indexing methods, including SQL-CBR, which uses index of the relational database to 
be the index method of the CBR, and Bitmap-CBR, which uses bitmap indexing 
method to construct the indexes of cases. Our target machine is a Pentium- 166 dual 
processors system, running the Microsoft Windows NT multithreaded OS. The system 
includes 512K L2 cache and 128MB shared-memory. 

Compare BWI-CBR with SQL-CBR 

In this comparison, the SQL-CBR uses Microsoft SQL server as the case base. The 
result of comparing BWI-CBR and SQL-CBR is shown in Fig. 3. 

According to the result, we can see the BWI-CBR is much more efficient than SQL- 
CBR. The reasons are: 
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• In relevant cases retrieving phase, BWI-CBR transfers the indexes to bit-wise in- 
dexes and uses bit operation to retrieve relevant prior cases. Because bit operation 
is quite fast, our retrieving relevant prior cases algorithm can filter out irrelevant 
cases efficiently. In the SQL-CBR, the target SQL statement needs to be trans- 
formed into several SQL statements with different WHERE clauses for handling 
the partial matching in retrieving relevant cases. Therefore, the BWI-CBR is much 
faster than SQL-CBR in this phase. 

• In similarity computing phase, the BWI-CBR uses bit operation to compare the new 
case and prior case in case base, and uses the Mask vectors to check the similarity 
between the new case and prior cases. That is, it doesn’ I need to check the contents 




Fig. 3. BWI-CBR vs SQL-CBR 

one by one for getting the similarities. However, the SQL-CBR must checks the detail 
of each feature’s contents for computing the similarity. Therefore, this task in BWI- 
CBR is still much faster than that in SQL-CBR. 



Fig. 4. BWI-CBR vs Bitmap-CBR 

Compare BWI-CBR with Bitmap-CBR 

The result of comparing the BWI-CBR with the CBR-Bitmap is shown in Fig. 4. We 
can see that BWI-CBR is faster than CBR-bitmap, The reasons are: 

• In relevant cases retrieving phase, the Bitmap indexing technology is not suitable 
for retrieving similar cases. It needs to check the more vectors than the BWI-CBR. 
So it needs more time than BWI-CBR. 

• In similarity computing phase, the Bitmap-CBR needs to check the ith-bit position 

iO ^Java' ^ORCALE’’ AND B,J, (B„,„ AND AND 

and AND AND Rorcazj;) vectors for case c,. Scanning to the ith position 

in the bitmap vectors needs some extra time, especially when the number of attrib- 
ute of cases or the number of cases in the case base are large. The waste time is 
lengthy and unbearable. Therefore, the performance of similarity computing in 
BWI-CBR is much faster than that in Bitmap-CBR. 




536 W.-C. Chen et al. 



6 Conclusion and Future Work 

In addition to accuracy, performance issue should also be taken into consideration in 
retrieving similar cases in CBR, especially when the number of cases in CBR is in- 
creasingly large. In this paper, the performance issue of large-scale CBR that using a 
data warehouse to be the storage space is discussed and a new indexing method, called 
bit-wise indexing method, has been proposed. Also, the correspondingly algorithms, 
including index creation and case retrieving algorithms, are proposed. Finally, some 
experiments are made for comparing the performance with two other methods, in- 
cluding traditional indexing method and the bitmap indexing method of data ware- 
housing, and the results show the performance of proposed method is admirable. In the 
future, we will attempt to apply the indexing method and corresponding retrieving 
algorithm to CBR with multi-processor data warehousing system. 
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Abstract. Many data sets show significant correlations between input 
variables, and much useful information is hidden in the data in a non- 
linear format. It has been shown that a neural network is better than a 
direct application of induction trees in modeling nonlinear characteristics 
of sample data. We have extracted a compact set of rules to support data 
with input variable relations over continuous- valued attributes. Those re- 
lations as a set of linear classifiers can be obtained from neural network 
modeling based on back-propagation. It is shown in this paper that vari- 
able thresholds play an important role in constructing linear classifier 
rules when we use a decision tree over linear classifiers extracted from a 
multilayer perceptron. We have tested this scheme over several data sets 
to compare it with the decision tree results. 



1 Introduction 

The discovery of decision rules and recognition of patterns from data examples 
is one of the most challenging problems in machine learning. If data points con- 
tain numerical attributes, induction tree methods need the continuous-valued 
attributes to be made discrete with threshold values. Induction tree algorithms 
such as C4.5 build decision trees by recursively partitioning the input attribute 
space 1E|. The tree traversal from the root node to each leaf leads to one con- 
junctive rule. Each internal node in the decision tree has a splitting criterion or 
threshold for continuous-valued attributes to partition some part of the input 
space, and each leaf represents a class related to the conditions of each internal 
node. 

Approaches based on decision trees involve making the continuous-valued 
attributes discrete in input space, creating many rectangular divisions. As a 
result, they may have the inability to detect data trends or desirable classification 
surfaces. Even in the case of multivariate methods of discretion which search in 
parallel for threshold values for more than one continuous attribute [5jl 5IJ . the 
decision rules may not reflect data trends or the decision tree may build many 
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rules with the support of a small number of examples or ignore some data points 
by dismissing them as noisy. 

A possible process is suggested to grasp the trend of the data. It first tries 
to fit it with a given data set for the relationship between data points, using a 
statistical technique. It generates many data points on the response surface of 
the fitted curve, and then induces rules with a decision tree. This method was 
introduced as an alternative measure regarding the problem of direct application 
of the induction tree to raw data it™ However, it still has the problem of 
requiring many induction rules to reflect the response surface. 

In this paper we use a hybrid technique to combine neural networks and de- 
cision trees for data classification HH. It has been shown that neural networks 
are better than direct application of induction trees in modeling nonlinear char- 
acteristics of sample data [411 tifi S|tij2()j . Neural networks have the advantage of 
being able to deal with noisy, inconsistent and incomplete data. A method to 
extract symbolic rules from neural networks has been proposed to increase the 
performance of the decision process 



The KT algorithm developed 
by Fu 0 extracts rules from subsets of connected weights with high activation in 
a trained network. The M of N algorithm clusters weights of the trained network 
and removes insignificant clusters with low active weights. Then the rules are 
extracted from the weights 

A simple rule extraction algorithm that uses discrete activations over con- 
tinuous hidden units is presented in by Setiono and Taha P3EI]- They used in 
sequence a weight-decay back-propagation over a three-layer feed-forward net- 
work, a pruning process to remove irrelevant connection weights, a clustering of 
hidden unit activations, and extraction of rules from discrete unit activations. 
They derived symbolic rules from neural networks that include oblique decision 
hyperplanes instead of general input attribute relations nn|. Also the direct con- 
version from neural networks to rules has an exponential complexity when using 
search-based algorithm over incoming weights for each unit |BI22|. Most of the 
rule extraction algorithms are used to derive rules from neuron weights and neu- 
ron activations in the hidden layer as a search-based method. An instance-based 
rule extraction method is suggested to reduce computation time by escaping 
search-based methods M After training two hidden layer neural networks, the 
first hidden layer weight parameters are treated as linear classifiers. These linear 
differentiated functions are chosen by decision tree methods to determine deci- 
sion boundaries after re-organizing the training set in terms of the new linear 
classifier attributes. 

Our approach is to train a neural network with sigmoid functions and to 
use decision classifiers based on weight parameters of neural networks. Then an 
induction tree selects the desirable input variable relations for data classification. 
Decision tree applications have the ability to determine proper subintervals over 
continuous attributes by a discretion process. This discretion process will cover 
oblique hyperplanes mentioned in Setiono’s papers. In this paper, we have tested 
linear classifiers with variable thresholds and fixed thresholds. The methods are 
tested on various types of data and compared with the method based on the 
decision tree alone. 
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2 Problem Statement 

Induction trees are useful for a large number of examples, and they enable us to 
obtain proper rules from examples rapidly HU However, they have the difficulty 
in inferring relations between data points and cannot handle noisy data. 






Fig. 1. Example (a) data set and decision boundary (O : class 1, X : class 0) (b)-(c) 
neural network fitting (d) data set with 900 points 



We can see a simple example of undesirable rule extraction discovered in 
the induction tree application. Fig. 1(a) displays a set of 29 original sample data 
with two classes. It appears that the set has four sections that have the bound- 
aries of direction from upper-left to lower-right. A set of the dotted boundary 
lines is the result of multivariate classification by the induction tree. It has six 
rules to classify data points. Even in C4.5 run, it has four rules with 6.9 % 
error, making divisions with attribute y. The rules do not catch data cluster- 
ing completely in this example. Fig.l(b)-(c) shows neural network fitting with 
the back-propagation method. In Fig.l(b)-(c) neural network nodes have slopes 
alpha = 1.5, 4.0 for sigmoids, respectively. After curve fitting, 900 points were 
generated uniformly on the response surface for the mapping from input space 
to class, and the response values of the neural network were calculated as shown 
in Fig. 1(d). The result of C4.5 to those 900 points followed the classification 
curves, but produced 55 rules. The production of many rules results from the 
fact that decision tree makes piecewise rectangular divisions for each rule. This 
happens in spite of the fact that the response surface for data clustering has a 
correlation between the input variables. 

As shown above, the decision tree has a problem of over-generalization for 
a small number of data and an over-specialization problem for a large number 
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of data. A possible suggestion is to consider or derive relations between input 
variables as another attribute for rule extraction. However, it is difficult to find 
input variable relations for classification directly in supervised learning, while 
unsupervised methods can use statistical methods such as principal component 
analysis |2|. 



3 Method 

The goal for our approach is to generate rules following the shape and charac- 
teristics of response surfaces. Usually induction trees cannot trace the trend of 
data, and they determine data clustering only in terms of input variables, unless 
we apply other relation factors or attributes. In order to improve classification 
rules from a large training data set, we allow input variable relations for multi- 
attributes in a set of rules. We use a two-phase method for rule extraction over 
continuous- valued attributes. 

Given a large training set of data points, the first phase, as a feature extrac- 
tion phase, is to train feed-forward neural networks with back-propagation and 
collect the weight set over input variables in the first hidden layer. A feature 
useful in inferring multi-attribute relations of data is found in the first hidden 
layer of neural networks. The extracted rules involving network weight values 
will reflect features of data examples and provide good classification boundaries. 
Also they may be more compact and comprehensible, compared to induction 
tree rules. 

In the second phase, as a feature combination phase, each extracted feature 
for a linear classification boundary is combined together using Boolean logic 
gates. In this paper, we use an induction tree to combine each linear classifier. 

The highly nonlinear property of neural networks makes it difficult to describe 
how they reach predictions. Although their predictive accuracy is satisfactory 
for many applications, they have long been considered as a complex model in 
terms of analysis. By using expert rules derived from neural networks, the neural 
network representation can be more understandable. 

It has been shown that a particular set of functions can be obtained with 
arbitrary accuracy by at most two hidden layers given enough nodes per layer 
0. Also one hidden layer is sufficient to represent any Boolean function [111) . 
Our neural network structure has two hidden layers, where the first hidden layer 
makes a local feature selection with linear classifiers and the second layer receives 
Boolean logic values from the first layer and maps any Boolean function. The 
second hidden layer and output layer can be thought of as a sum of the product 
of Boolean logic gates. The n-th output of neural networks for a set of data is 
Fn = f{J2k^ W^fen/(Ef' After training data patterns with 

a neural network by back-propagation, we can have linear classifiers in the first 
hidden layer. 

For a node in the first hidden layer, the activation is defined as Hj = 
Oikbij) for the j-th node where Nq is the number of input attributes, 
Qi is an input, and f{x) = 1.0/(1.0-|-e“““) is a sigmoid function. When we train 
neural networks with the back-propagation method, a, the slope of the sigmoid 
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function is increased as iteration continues. If we have a high value of a, the 
activation of each neuron is close to the property of digital logic gates, which 
has a binary value of 0 or 1 . 

Except for the first hidden layer, we can replace each neuron by logic gates if 
we assume we have a high slope for the sigmoid function. Input to each neuron 
in the first hidden layer is represented as a linear combination of input attributes 
and weights, o^iWik- This forms linear classifiers for data classification as a 
feature extraction over data distribution. When Fig. 1(a) data is trained, we can 
introduce new attributes aX + bY , where a, b are constants. We use two hidden 
layers with 4 nodes and 3 nodes, respectively, where every neuron node has a 
high sigmoid slope to guarantee desirable linear classifiers as shown in Fig. 1(c). 

We transformed 900 data points in Fig. 1(d) into four linear classifier data 
points, and then we added the classifier attributes to the original attributes x, y. 
Induction tree algorithm used those six attributes for its input attributes. Then 
we could obtain only four rules with C4.5, while a simple application of C4.5 for 
those data generated 55 rules. The rules are given as follows: 

rule 1 : if (1.44x+ 1.73y <= 5.98), then class 0 
rule 2: if (1.44x + 1.73y > 5.98) 

and (1.18a; + 2.8ly <= 12.37) then class 1 
rule 3 : if(1.44a; + 1.73y > 5.98) 

and (1.18a; + 2.81J/ > 12.37) 
and (0.53a; + 2.94j/ < 14.11), then class 0 
rule 4 : if(1.44a; + 1.73y > 5.98) 

and (1.18a; + 2.81J/ > 12.37) 

and (0.53a; + 2.94y > 14.11), then class 1 

These linear classifiers exactly match with the boundaries shown in Fig. 1(c), 
and they are more dominant for classification in terms of entropy minimization 
than a set of original input attributes itself. Even if we include input attributes, 
the entropy measurement leads to a rule set with boundary equations. These 
rules are more meaningful than those of direct C4.5 application to raw data 
since their divisions show the trend of data clustering and how each attribute is 
correlated. 

Our approach can be applied to the data set that has both discrete and con- 
tinuous values. If there is a set of input attributes, Y = {Di, ..., D^, Ci, ..., C„}, 
then Di is a discrete attribute, Cj is a continuous- valued attribute, m is the num- 
ber of discrete attributes, and n is the number of continuous-valued attributes. 
Any discrete attribute has a finite set of values available. For example, if 
there is a value set da, 2 , da,p} for a discrete attribute we can 

have a Boolean logic value for each discrete attribute, using the conditional 
equation = d^j, for j = 1, ..,p. We can put this state as a node in the first 
hidden layer, and then one of linear classifiers obtained with neural network is 
Lk = = X” CiVFjfc -I- xr diWif. where is an instance of data 

in the form of the set Y , Ci is an instance of numeric attributes, and di is an 
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instance of discrete attributes. The second term can be treated as a 

threshold for the linear classifier Lk- 

Since we have no interest in the relation of discrete attributes whose numeric 
conditions and coefficient values are not meaningful in this model, the discrete 
attributes can be taken as variable thresholds. As a result, the value of linear 
classifier Lk only depends on a linear combination of continuous attributes and 
weights. The choice of discrete attributes in rules can be handled using induction 
tree algorithms more properly, without interfering the relations of continuous- 
valued attributes. 

Induction trees can split any continuous value by selecting thresholds for 
given attributes, while it cannot derive the relation of input attributes directly. In 
our method, we can add to the data set of the induction tree, new attributes = 

CiWik for fc = 1, .., r, where r is the number of nodes in the first hidden layer 
for continuous- valued attributes. The new set of attributes for the induction tree 
is Y' = {Di,D 2 , ..., Dm, Cl, C 2 , ..., Cn, Li, L 2 , ..., Tr}- The entropy measurement 
will try to find out a significant classification over the new set of attributes. We 
can have another attribute set which consists of only linear classifiers generated 
by neural network as follows: 

Y" = {Di,D2, ..., Dm, Li, L 2 , ■■■, Lr} 

{C + L} linear classifiers, including both original input attributes and neural 
network linear classifiers together, were tested with some data sets in the UCI 
depository |5| to compare it with L-linear classifier method which only includes 
neural network linear classifiers m- It is believed that a compact set of at- 
tributes to represent the data set shows a better performance. Adding original 
input attributes does not improve the result, but it makes its performance worse 
in most cases. C4.5 has a difficulty in selecting properly the most significant 
attributes for a given set of data, because it chooses attributes with local en- 
tropy measurement and the method is not a global optimization of entropy. Also, 
especially when only linear classifiers from neural network are used, it is quite 
effective in reducing the number of rules El- 

Generally when we give many feature attributes to the induction rule gen- 
erator based on C4.5, it has a tendency to worsen performance. This is because 
the induction tree is based on a locally optimal entropy search. In this paper, a 
compact L-linear classifier method was tested. We compared L-linear classifiers 
with fixed thresholds and variable thresholds ; fixed thresholds are chosen by 
neural network and variable thresholds are selected by induction tree algorithm. 

All instances in the training data can be converted into Boolean logic values 
and then they are applied to the induction tree algorithm C4.5. This method 
uses given thresholds determined by neural network training, and each hidden 
node activation is taken as a Boolean logic value. It is equivalent to logic circuit 
minimization problem that finds a simple form of Boolean circuits. As a result, 
it classifies data with newly constructed attributes. Decision trees can be seen 
as a kind of heuristic method for logic circuit minimization. We collect a set 
of Boolean logic instances depending on the activation of each node in the first 
hidden layer of the neural network and then it is given to the C4.5 induction 
tree to construct logical functions over linear classifiers. C4.5 can also prune rules 
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over Boolean logic instances, even when it sees inconsistent class mappings from 
input Boolean instances to output classes. 

Another method is to use a set of linear classifiers as continuous-valued at- 
tributes. The C4.5 application over instances of linear classifiers will try to find 
the best splitting thresholds for discretion over each linear classifier attribute 
to classify training data. In this case, each linear classifier attribute may have 
multivariate thresholds, different from thresholds obtained in neural network 
training. This will show the difference between neural networks and induction 
trees in handling marginal boundaries of linear classifiers. 



4 Experiments 

Our method has been tested on several sets of data in the UCI depository |2|. 
Fig. 2 and Table 1 show average classification error rates for neural networks 
and the C4.5 [Ej algorithm, and Fig. 3 and Table 2 show error rates in our two 
methods. We show the results of linear classifier methods, the pure C4.5 method, 
and neural networks. The error rates were estimated by running the complete 
10-fold cross-validation ten times, and the average and the standard deviation 
for ten runs were given in the table. 

Our methods using linear classifiers are better than C4.5 in some sets and 
worse in other data sets such as glass and pima which are hard to predict even in 
neural network. The result supports the fact that the methods greatly depend on 
neural network training. If neural network fitting is not correct, then the fitting 
errors may mislead the result of linear classifier methods. Normally, the C4.5 
application shows the error rate is very high for training data in Table 1. The 
neural network can improve training performance by increasing the number of 
nodes in the hidden layers as shown in Table 1. However, it does not mean that 
it improves test set performance. In many cases, reducing errors in a training 
set tends to increase the error rate in a test set by overfitting. 

The error rate difference between a neural network and linear classifiers ex- 
plains that some data points are located on marginal boundaries of classifiers. It 
is due to the fact that our neural network model uses sigmoid functions with high 
slopes instead of step functions. When activation is near 0.5, the weighted sum 
of activations may lead to different output classes. If the number of nodes in the 
first hidden layer is increased, this marginal effect becomes larger as observed 
in Table 1 and Table 2. Fig. 3(b) and Table 3 shows that the number of rules 
using our method is significantly smaller than that using conventional C4.5 in all 
the data sets. To reduce the number of rules, linear classifiers with the Boolean 
circuit model greatly depend on the number of nodes in the first hidden layer. 
It decreases the number of rules when the number of nodes decreases in the 
first hidden layer, while the error rate performance is similar within some limit, 
regardless of the number of nodes. The linear classifier method with variable 
thresholds also depends on the number of nodes. The reason why the number 
of rules is proportional to the number of nodes is related to the search space 
of Boolean logic circuits. The linear classifier method with the Boolean circuit 
model often tends to generate rules that have a small number of support exam- 
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Fig. 2. Difference between C4.5 and neural network (a) average error rate in training 
data (b) average error rate in test data 



Table 1. Data classification error rate result in neural network and C4.5 





neural network 


C4.5 


data 


pat / attr 


train (%) 


test (%) 


nodes 


train (%) 


test (%) 


wine 


178 / 13 
178 / 13 


0 ± 0 
0 ± 0 


1.8 ± 0.6 
1.9 ± 0.6 


13-8-5-3 

13-5-5-3 


1.2 ± 0.1 
1.2 ± 0.1 


6.6 ± 1.2 
6.6 ± 1.2 


iris 


150 / 4 


0.6 ± 0.1 


4.1 ± 1.3 


4-5-4-3 


1.9 ± 0.1 


5.4 ± 0.7 


breastw 


683 / 9 


0.3 ± 0.1 


4.9 ± 0.7 


9-8-5-3 


1.1 ± 0.1 


4.7 ± 0.5 


ion 


351 / 34 


1.3 ± 0.2 


8.8 ± 1.3 


34-10-7-2 


1.6 ± 0.2 


10.4 ± 1.1 


pima 


768 / 8 
768 / 8 
768 / 8 


4.9 ± 0.4 
8.8 ± 0.7 
12.7 ± 0.4 


27.8 ± 0.7 
27.8 ± 0.7 
27.1 ± 0.9 


8-15-9-2 

8-10-7-2 

8-7-7-2 


15.1 ± 0.8 
15.1 ± 0.8 
15.1 ± 0.8 


26.4 ± 0.9 
26.4 ± 0.9 
26.4 ± 0.9 


glass 


214 / 9 
214 / 9 
214 / 9 


2.3 ± 0.4 

4.3 ± 0.9 
5.0 ± 0.7 


32.1 ± 2.6 
31.6 ± 1.2 
32.1 ± 1.9 


9-15-12-7 

9-15-8-7 

9-10-8-7 


6.7 ± 0.4 
6.7 ± 0.4 
6.7 ± 0.4 


32.0 ± 1.5 
32.0 ± 1.5 
32.0 ± 1.5 


bupa 


345 / 6 
345 / 6 
345 / 6 


7.3 ± 1.4 

9.3 ± 0.7 
15.2 ± 0.9 


32.7 ± 2.2 
32.1 ± 1.9 

32.8 ± 1.9 


6-10-7-2 

6-8-6-2 

6-5-5-2 


13.1 ± 0.7 
13.1 ± 0.7 
13.1 ± 0.7 


34.5 ± 1.8 
34.5 ± 1.8 
34.5 ± 1.8 



pies, while variable threshold model prunes rules by adjusting splitting thresholds 
in the decision tree. 

Most of the data sets in the UCI depository have a small number of data 
examples relative to the number of attributes. The significant difference between 
a simple C4.5 application and a combination of C4.5 application and a neural 
network is not seen distinctively in UCI data in terms of error rate unlike the 
synthetic data in Fig.l. Information of data trend or input relations can be more 
definitely described when given many data examples relative to the number of 
attributes. 

Table 1 and 2 show that neural network classification is better than linear 
classifier applications. Even though linear classifier methods are good approxi- 
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(a) (b) 

Fig. 3. Difference between C4.5 and linear classifier methods (a) test data average 
error rate for C4.5, linear classifiers with fixed and variable thresholds (b) the number 
of rules for C4.5, linear classifiers with fixed and variable thresholds 




(a) (b) 



Fig. 4. Linear classifier methods with several neural networks (a) test data average 
error rate for C4.5 and linear classifiers with 3 different neural networks (see Table2) 
(b) the number of rules for C4.5 and linear classifiers with 3 different neural networks 
(see Table3) 



mations to nonlinear neural network modeling in the experiments, we still need 
to reduce the gap between neural network training and linear classifier models. 
Also it may be necessary to prove that linear classifiers with variable thresholds 
may be a close approximation to neural network modeling theoretically. There 
is a trade-off between the number of rules and error rate performance. We need 
to explain what is the optimal number of rules for a given data set for future 
study. 



Rule Reduction over Numerical Attributes in Decision Trees 



547 



Table 2. Data classification error result in our method using linear classifiers 





variable thresholds^^^ 


fixed thresholds^^^ 


data 


nodes 


train (%) 


test (%) 


train (%) 


test (%) 


wine 


13-8-5-3 

13-5-5-3 


0.2 ± 0.1 
0.0 ± 0.1 


4.2 ± 1.2 
3.1 ± 0.6 


0.3 ± 0.1 
0.2 ± 0.2 


3.6 ± 1.3 
3.2 ± 0.8 


iris 


4-5-4-3 


0.7 ± 0.2 


4.7 ± 1.5 


2.0 ± 0.4 


5.3 ± 1.2 


breast-w 


9-S-5-3 


0.9 ± 0.2 


4.4 ± 0.3 


2.2 ± 0.3 


4.5 ± 0.3 


ionosphere 


34-10-7-2 


1.2 ± 0.2 


8.8 ± 1.5 


1.6 ± 0.2 


9.2 ± 1.6 


pima 


8-15-9-2 

8-10-7-2 

8-7-7-2 


13.4 ± 0.8 
14.6 ± 1.0 
15.8 ± 0.4 


27.6 ± 1.3 
26.9 ± 1.1 
27.0 ± 0.9 


10.4 ± 0.4 

13.5 ± 0.5 
16.4 ± 0.4 


28.5 ± 0.7 
27.7 ± 1.2 

26.6 ± 0.9 


glass 


9-15-12-7 

9-15-8-7 

9-10-8-7 


6.7 ± 0.6 
6.6 ± 0.8 
7.6 ± 1.1 


36.6 ± 2.7 
34.9 ± 1.6 
36.0 ± 2.5 


12.2 ± 0.5 
12.0 ± 0.6 
13.8 ± 0.5 


34.0 ± 2.4 
33.6 ± 3.1 
34.4 ± 2.3 


bupa 


6-10-7-2 

6-8-G-2 

G-5-5-2 


15.2 ± 1.3 
13.6 ± 1.5 
17.5 ± 1.0 


32.7 ± 2.9 

32.5 ± 1.7 

33.6 ± 2.3 


17.3 ± 0.7 

15.1 ± 0.6 

22.2 ± 1.1 


32.7 ± 1.6 
34.0 ± 2.4 
34.3 ± 2.3 



Table 3. Number of rules for each method 



data 


nodes 


C4.5 


variable 


fixed T<^> 


wine 


13-8-5-3 

13-5-5-3 


5.5 ± 0.3 
5.5 ± 0.3 


3.1 ± 0.1 
3.0 ± 0.0 


3.4 ± 0.3 
3.2 ± 0.1 


iris 


4-S-4-3 


4.9 ± 0.1 


4.0 ± 0.4 


4.1 ± 0.2 


breast-w 


9-8-5-3 


18.6 ± 0.7 


7.8 ± 0.8 


7.9 ± 0.9 


ionosphere 


33-10-7-2 


14.0 ± 0.6 


6.2 ± 0.8 


5.1 ± 0.3 


pima 


8-15-9-2 

8-10-7-2 

8-7-7-2 


26.7 ± 2.3 
26.7 ± 2.3 
26.7 ± 2.3 


23.4 ± 3.0 
18.1 ± 2.4 
13.9 ± 1.0 


58.9 ± 3.0 

33.9 ± 2.8 
19.1 ± 1.0 


glass 


9-15-12-7 

9-15-8-7 

9-10-8-7 


26.1 ± 0.9 
26.1 ± 0.9 
26.1 ± 0.9 


23.5 ± 1.0 
22.9 ± 0.8 
23.1 ± 0.7 


26.1 ± 0.9 
25.0 ± 1.3 
21.9 ± 1.0 


bupa 


6-10-7-2 

6-8-G-2 

6-5-S-2 


29.3 ± 1.4 
29.3 ± 1.4 
29.3 ± 1.4 


16.2 ± 1.7 

14.3 ± 2.1 
10.6 ± 1.3 


24.4 ± 2.3 

19.5 ± 1.3 
9.4 ± 0.9 



5 Conclusions 

This paper presents a hybrid method for constructing a decision tree from neural 
networks. Our method uses neural network modeling to find unseen data points 
and then an induction tree is applied to data points for symbolic rules, using 
features from the neural network. The combination of neural networks and in- 
duction trees will compensate for the disadvantages of one approach alone. This 
method has advantages over a simple decision tree method. First, we can ob- 
tain good features for a classification boundary from neural networks by training 
input patterns. Second, because of feature extractions about input variable re- 
lations, we can obtain a compact set of rules to reflect input patterns. 
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We still have much work ahead, such as applying the minimum description 
length principle to reduce the number of rules and error rate, and finding the 
optimal number of linear classifiers. 
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Abstract. A Knowledge Acquisition method “Ripple Down Rules” can 
directly acquire and encode knowledge from human experts. It is an incre- 
mental acquisition method and each new piece of knowledge is added as 
an exception to the existing knowledge base. This knowledge base takes 
the form of a binary tree. There is another type of knowledge acquisition 
method that learns directly from data. Induction of decision tree is one 
such representative example. Noting that more data are stored in the 
database in this digital era, use of both expertise of humans and these 
stored data becomes even more important. In this paper, we attempt 
to integrate inductive learning and knowledge acquisition. We show that 
using the minimum description length principle, the knowledge base of 
Ripple Down Rules is automatically and incrementally constructed from 
data and thus, making it possible to switch between manual acquisition 
by a human expert and automatic induction from data at any point of 
knowledge acquisition. Experiments are carefully designed and tested to 
verify that the proposed method indeed works for many data sets having 
different natures. 



1 Introduction 

We pay attention to the Ripple Down Rule Method (RDR) I2KI as a promising 
approach to constructing a knowledge-based system in an environment in which 
the rapid innovation in technology makes existing knowledge being out-of-date 
in a very short time and requires frequent updates |0ES|. In RDR, Knowledge 
Acquisition (KA) is regarded as a continuous refinement of existing knowledge. 
It is interactive and there is no distinction between knowledge acquisition and 
maintenance. 

Since RDR is primarily a method to capture knowledge from a human ex- 
pert, it heavily relies on human expert’s judgment. Although it is known that a 
human expert is good at explaining why a particular instance is misclassified and 
justifying what kind of remedy needs to be made, she is by no means almighty. 
Humans make mistakes. Recent advancement of machine learning makes it pos- 
sible to induce a classifier from data quite efficiently, e.g. ma. Further, it is often 
the case that there has already been a large quantity of data on databases. There 
is no reason not to use these data in building knowledge-based systems. 
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In this paper, we explore possibility of integrating the inductive learning 
method and the standard RDR method in a unified way. We propose to use the 
Minimum Description Length Principle 112 (MDLP) as an underlying princi- 
ple for this integration. Inducing the RDR knowledge base from data has been 
studied by Gains and Compton m- They used Induct m in which the basic 
algorithm is to search for the premise for a given conclusion that is least likely to 
predict that conclusion by chance. Their method, although it produces the RDR 
knowledge base, does not seem to have a common strategy by which a knowledge 
acquisition approach and a machine learning approach are integrated. 

We examine the proposed methods against 25 benchmark data sets ^ 
d and show that they work as expected. This suggests that it is feasible to 
develop a flexible knowledge-based system: the knowledge base is constructed 
by a human expert at an earlier stage of the development where there is not 
many data available, and then it is refined by using an induction technique at a 
later stage where there are enough data available. 

2 Ripple Down Rules Revisited 

The basis of RDR is the maintenance and retrieval of cases. When a case is 
incorrectly retrieved by an RDR system, the KA (maintenance) process requires 
the expert to identify how a case stored in a knowledge-based system differs from 
the present case. 



HDOtnode 




rootnode 


E tine then C Is 0 
ODmeistone case 0 


^ D efeultKnow lodge 


EttuethaiCIsO 
cDineistDne ca^ 0 



_ ^ ^ DiffeimceList 

Defeul: I 

Knowledge ^ b c 



If A thenClsl 
comerstane case 1 



node2 




nodes 


ED thffliC]s2 




EB thaiCIsS 


oomeistme case 2 




comeistme caffi 3 



EE thenCIs6 
cpm etston e case 6 



ED thaiClsB 
comerstane case 5 




Know Hedge sbnctme of RD R (b) Know ledge aogQistbn in RD R 

Fig. 1. Knowledge structure of the Ripple Down Rules Method 



The tree structure of an RDR knowledge base is shown in Fig. [D(a). Each 
node in the binary tree is a rule with a desired conclusion. Each node has a 
“cornerstone case” associated with it, that is, the case that prompted the in- 
clusion of the rule. An inference process for an incoming case starts from the 
root node of the binary tree and continues until there is no branch to move on. 
The conclusion for the case is the conclusion part of the “last satisfied node”. 
If the class is different from the class which a human expert judges the case to 
be, knowledge (new rule) is acquired from the human expert. The KA process in 
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RDR is illustrated in Fig. E(b). When the expert wants to add a new rule, there 
must be a case that is misclassified by a rule in RDR. The system asks the expert 
to select conditions from the “difference list” between the misclassified case and 
the cornerstone case. Then the misclassified case is stored as the refinement case 
(new cornerstone case) with the new rule whose condition part distinguishes 
these cases. Depending on whether the last satisfied node is the same as the 
end node, the new rule and its cornerstone case are added at the end of YES 
or NO branch of the end node. Knowledge is never removed or changed, simply 
modified by the addition of exception rules. The tree structure of the knowledge 
and the keeping of cornerstone cases ensure that the knowledge is always used 
in the same context when it was added. 



3 The Minimum Description Length Principle 

Occam’s razor, or minimizing the description length, is the normal practice for 
selecting the most plausible one of many alternatives. Occam’s razor prefers 
the simplest hypothesis that explains the data. Simplicity of hypothesis can be 
measured by description length (DL), originally proposed by Rissanen JE]. DL 
can express the complexity of specifying a hypothesis, and the value of DL is 
calculated as the sum of two encoding costs: one for the hypothesis and the 
other for cases misclassified with the hypothesis. The MDLP has been used as a 
criterion to select a good model in machine learning, e.g. in decision trees El, 
neural networks jjj and Bayesian networks m- 



Table 1. Examples of cases 



ID No. 


Att. Swim 


Att. Breath 


Att. Legs 


Class 


1 


can 


lung 


21egs 


Dog 


2 


can 


lung 


41egs 


Penguin 


3 


can 


skin 


21egs 


Monkey 


4 


can 


skin 


41egs 


Dog 


5 


can_not 


lung 


21egs 


Dog 


6 


canmot 


lung 


41egs 


Monkey 


7 


can_not 


gill 


21egs 


Penguin 


8 


canmot 


gill 


41egs 


Dog 


9 


canmot 


skin 


21egs 


Dog 


10 


canmot 


skin 


41egs 


Monkey 



We illustrate the concept of the MDLP, using a communication problem. Let 
us suppose that both a sender A and a receiver B have the same list of Table [D 
except that B does not know the class information. A communication problem 
is to send the class information from A to B through a communication path 
with as few bits as possible. Knowledge-base models such as decision trees or 
binary trees in RDR can be thought to be composed of a splitting method and 
a representative class for each split subset. Thus, the calculation of DL consists 
of the following 4 steps. 
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Step 1 : Split cases in the list into several subsets according to the attribute 
values on the basis of some splitting method (a model). 

Step 2 : Encode the model on the basis of a certain coding method, and send 
the resultant bit sequence to B. 

Step 3 : Encode the representative class of each subset, and send the respective 
bit sequence to B. 

Step 4 : Encode the true class of each misclassified case, which is different from 
the representative class, and send the respective bit sequence to B. 

If the splitting method used in Step 1 is complex, most likely the DLs in 
Step 2 and in Step 3 are large, and the DL in Step 4 is small. In reverse, 
the DL in Step 4 will be larger if the splitting method is simpler. Therefore, 
there is a trade-off relation between 1) the DL for the splitting method (Step 
2) and the DL for the representative classes (Step 3) and 2) the DL for the 
class information of the misclassified cases (Step 4). The total DL (the DL for 
a knowledge-base model -I- the DL for misclassified cases in the subsets) can 
be calculated by assuming a certain encoding method. The MDLP says that a 
knowledge-base model with the smallest total DL predicts the classes of unseen 
cases well. In the case of RDR, it is to say that a binary tree with the smallest 
total DL has the lowest error rate for unseen cases. 

4 Calculation of the Total DL 

In this section, we briefly explain how we encode the DL. Note that this is not the 
only way to encode the DL. There are other ways to do so, but the experimental 
results show that our encoding method is reasonable. 



Class Info. :Dog,Penguin, Monkey 



Attribute Info. ; 

Swim : can, can not 
Breath ; lung, gill, skin 
Legs : 21egs, 41egs 



* Cases for which the node (X) 
is last satisfied node 

7 can_oot, gill, 21egs:Dog 

8 can_oot, gill, 41egs:Monkey 

9 can_not, skin, 21egs:Penguin 

10 can_not, skin, 41egs:Penguin 

[Attribute-space of the node @ | 




sK" 

Breath 

Legs 



can(4), cah»ot(0) 
lung(2), ^0), skin(2) 
21egs(2), 41egs(2) 






•Cases which go through 
the node ® at the inference 
process 

1 can, lung, 21egs:Dog 

2 can, lung, 41egs:Penguin 

3 can, skin, 21egs:Monkey 

4 can, skin, 41egs:Dog 



Fig. 2. An example knowledge base to calculate the DL 

The total DL is the sum of the DL for the binary tree and the DL for cases 
misclassified by the tree. We explain the way to calculate the DL of a binary 
tree in subsection 14. 1 1 and that of misclassified cases by this tree in subsection 
14.21 Explanation is general but we instantiate it by using the RDR knowledge 
base given by Fig.|2|and the set of cases given by Tabled 
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4.1 The DL of the Binary Tree 

First, inference is made for all the cases in the list to obtain the attribute-space of 
each node in the binary tree. Assume that k cases go through a node C in the tree. 
From those k cases, we obtain the frequency distribution of each attribute-value 
in the node C. The corresponding attribute-space consists of a set of attributes 
each having at least 2 attribute-values with its frequency of at least 1 case. 
The attribute-space of node No. 3 is {Att.Breath:lung,skin Att.Legs:21egs,41egs}, 
which is depicted in the lower left-hand side of Fig. |21 

There are two kinds of information to be encoded for the node C: one for 
the branches below the node and the other for an If-Then rule that is stored in 
the node as knowledge itself. The branch-information of the node No. 3 is {YES- 
branch:nq NO-branch:no} and the rule-information of the node is {If Legs=41egs 
then Dog}. There are four branch-combinations for all nodes except for the root 
node, that is, the number of candidates is 4. Therefore, log2 4C1 bits are necessary 
to encode this information for the node C. The DL for the branch-information 
of node No. 3 is also log2 4C1 bits. The information needed to describe the If- 
Then rule consists of 4 components, (1) {the number of attributes used in the 
condition part}, (2) {attributes used in the condition part}, (3) {the attribute 
value for each used attribute} and (4) {the class used in the conclusion part}. In 
the case of node No. 3, (l){the number of attributes:!}, (2){attributes:Legs}, (3) 
{the value of Att.Legs:41egs},(4){the class:Dog}. log2raCi bits are necessary to 
encode the information of (1) because there are n candidates, {l,2,...,n}. In the 
case of node No. 3 it is log2 2 C 1 because the attribute-space has two attributes. 
Let the number of attributes used in the condition part be t. The information 
of (2) can be encoded by log2 nCt bits because the number of combinations of 
having t I’s and n-t O’s is given by „Ct. In the case of node No. 3, it is log2 2 C 1 
bits. Next, log2 miC*! +log2 2C'i bits are necessary for each used attribute to 
encode the information of (3). The second term is the DL necessary to specify 
whether it is negated or not. In the case of node No. 3, it is log2 2 C 1 + log2 2 C 1 
bits because the attribute Legs is the only one used. Finally log2 ciass_num-iCi 
bits are necessary to encode the information of (4) because the number of classes 
that are possible to use as the conclusion in each node except for the root node is 
class jnum— 1. Here, class jnum is the number of classes in the problem domain. 
In the case of node No. 3, it is log2 2^1 bits because the candidates are Dog and 
Monkey. The sum of DLs for the information of (1), (2), (3) and (4) is the DL 
necessary to encode the If-Then rule in the node C. 

The sum of the DL for the branch-information, which was mentioned in the 
beginning, and the one for the rule-information is the DL for the node C. In the 
case of node No. 3, it is 51og2 2C'i + log2 4Ci bits. If the sender A encodes the 
information of all nodes in the tree and send it to the receiver B , B can decode it 
and obtain the used splitting method and the representative classes (the binary 
tree itself). 
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4.2 The DL for Misclassified Cases 

We next explain how to calculate the DL that is necessary to encode the true 
class information for the cases misclassified by the given binary tree. This DL 
can be calculated at each node in the tree. 

Assume that there is a node D in the binary tree which is the last satisfied 
node for r cases as a results of running all cases in the list. That is, these r 
cases form a subset that has the conclusion of the node D as its representative 
class. In the case of node No.l in Fig. El they are {7,8,9,10}. Further assume 
that k cases out of r are misclassified. In the case of node No.l, 3 cases out of 
4 are misclassified and have different classes from the representative class: Dog. 
First, it is necessary to encode the number of classes that are different from the 
representative one in r cases. The DL is log 2 dass^numCi because the candidate 
number is classjnum, that is, {0, 1, ..., class jnum — 1}. In the case of node No.l 
it is log 2 3 C 1 bits. Next, we calculate the DL which is necessary to specify which 
cases are which classes. Let the number of cases with the f-th different class be 
Pi(i = 1, 2, ..., s) (s: the number of classes different from the representative one). 
The different classes are ordered to satisfy Ps >Ps-i > > P 2 > Pi- In the case 

of node No.l, it is P 2 = 2 (Penguin), pi = 1 (Monkey). With this preparation, 
the DL is calculated by the algorithm shown in Fig. El 

function DescriptionLength: real; 

variable dt. real; 

case, i\ integer; 

begin 

dl ■- 0 ; 

case := r; # number of cases 

i ■.= s\ ik number of different classes 

repeat 

dl := dl + log 2 (f); # specifying the i-th class 
if root node then 
begin 

dl := dl + logj {case — i + 1); # specifying the value of pi 

end 

else 

begin 

dl := dl + logj {case — i); # specifying the value of pi 

end 

dl := dl + log 2 (caseCpJ; # specifying pi cases 
case := case — pi', # number of remaining cases 
i ■.= i — 1- ik number of remaining classes 

until i = 0; 

DescriptionLength— dl; 

end 

Fig. 3. Algorithm to calculate the DL to specify pi cases {i = 1, 2, ..., s) 

It is now possible to find the true classes of k cases by decoding the encoded 
signal of “DescriptionLength^^ bit long. All the remaining r — k cases have the 
representative class. In the case of node No.l, the 8 th case is Monkey, and the 
9th and the 10th cases are Penguin. If the sender A encodes the information for 
the misclassified cases in each subset, the receiver B can get the true classes of 
the cases misclassified by the tree. 
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4.3 The Total DL 

Finally, by sending the encoded signal with the total DL (the sum of the DLs 
mentioned in subsection EH and subsection E3, the receiver B is able to find 
the class information of all cases in the list. 

It is empirically known that many encoding methods used for the construc- 
tion of knowledge-based systems based on the MDLP, including the one men- 
tioned here, tend to overestimate the DL necessary to encode the knowledge 
base, compared with the one to encode true classes of the misclassified cases. 
Therefore, it is common to use a weighted sum of the two. 

the total DL = {the DL for classes of misclassified cases) 

+ W X {the DL for the knowledge base) (1) 

Here, IT is a coefficient, which is less than 1, and in this paper we empirically 
found that 0.3 is a good value for W. 

5 The MDLP-Based Knowledge Acquisition Methods 

We propose two kinds of knowledge acquisition methods that are based on the 
MDLP in RDR. One is for constructing a binary tree by using data alone and 
the other is for constructing the tree by using both data and human experts. 

5.1 A Method That Uses Data Alone 

In the standard RDR, a set of elements selected from the difference list by 
a human expert becomes the condition part of a new node to be added to 
the so far grown binary tree. However, based on the MDLP, we want to select 
a set of elements among the possible sets in the difference list that gives the 
minimum total DL for the whole tree. Let us assume that a problem domain 
has n attributes {Ai\i = l,...,n} and attribute-values {vij\j = 

Let a current case misclassified by the so far grown RDR tree be defined as 
case A : {vi^a,V 2 ,a, ■■■,Vn,a{vi,a S Ai)}, and a cornerstone case whose node has 
derived the false conclusion (the last satisfied node) be defined as case B : 
{vi^b,V 2 ,b, ■■■,Vn,b{vi,b G Ai)}. Figure 0 is an example in which the case A is 
{vi^ 2 ,V 2 ,i,V 3 , 2 } and the case B is {uip, U 2 ,i, Details of the search algorithm 
is omitted due to the space limitation, but it is a greedy search as shown in Fig.E 
The search starts with a condition which is specialized to case and while 
expanding its search space, it finds a condition that falls in a (local) minimum 
of the total DL. This method enables to construct an RDR knowledge base by 
using data alonc0 without the help of human experts. 

^ It is natural to start with this condition because this is the only evidence that is 
against the cornerstone case. Further note that any choice of an element in the 
difference list differentiates between the the cornerstone case and the misclassfied 
case. 

^ Implicit assumption is that the data are labeled. 
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misclassified case 


vl,2 


v2,l 


v3,2 


classrP 


cornerstone case 


vl,l 


v2,l 


v3,l 


classiN 


difference list= {vl,2 , not(vl,l) , v3,2 , not(v3,l)} 




Fig. 4. Search space that uses data alone 



5.2 A Method That Uses Both Data and Human Experts 

The greedy search described in subsection 15. II starts with a condition specialized 
to a current misclassified case, and this is not necessarily a good starting point. 
Another possibility is to start with the condition that an expert has selected if 
such an advice is available as in the case of the standard RDR. The advantage 
of this method is that it can find a better condition even when the expert fails 
to select the best correct conditions from the difference list, and in general the 
better the expert’s guess is, the smaller the found DL is. 

6 Experiments 

Before examining the effectiveness of the methods we proposed in previous sec- 
tion, we need to ascertain that the MDLP holds for binary trees of the standard 
RDR. After we confirm this, we examine whether the method proposed in sub- 
section o can indeed construct more accurate binary trees than the standard 
RDR binary trees and whether the method proposed in subsection O can con- 
struct binary trees which are as accurate as the induction rule sets obtained by 
C4.5 poll which we select as a standard machine learning method. 

Databases used for experiments: We have selected 24 databases from Uni- 
versity of California Irvine Data Repository 1 database from Uni- 

versity of Toronto Data Repository PI- Of these, 13 databases have only 
nominal attributes, 9 databases only numerical attributes and 3 databases 
mixed attributes We discretized the numeric attributes beforehand. We ran 
C4.5 to build a decision tree using the whole data for each database that has 
numeric attributes and used the same discretization thresholds that C4.5 
found by information gain ratio criterion. 

Simulated expert: We use a Simulated Expert (SE) (machine generated ex- 
pert) instead of a human expert for the reproduction of experiments and 
consistent performance estimation. The SE has a set of If-Then rules de- 
rived from a decision tree constructed by C4.5 using a whole training data 
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(explained in snhseetion 16.1 1 . A set of elements selected from the difference 
list by the SE is defined as the intersection between the list and the con- 
dition part of the If-Then rule in the SE which predicts correctly the case 
misclassified by the RDR system. 



6.1 Experiment 1 to Examine Whether the MDLP Holds for the 
Standard RDR 



We randomly selected 75% cases from a database as a “training data set” and 
used them as incoming cases to the RDR system, and remaining 25% cases as 
a “test data set” for testing the accuracy of binary trees. The SE is constructed 
using the identical training data set. The same set of 75% cases is also used as 
the data set for calculating total DLs for binary trees. Then, using both the SE 
and training data set, a binary tree is constructed in the standard RDR. We 
plot the total DL and the number of misclassified cases of the test data each 
time a new case comes in and a new node is added. Because the results vary 
and depend on the order of incoming cases, we randomly generate 10 different 
orderings and plot all of the 10 trials in the same two-dimensional plane. 




the total DL 



Fig. 5. Result for the database “Car Evaluation” for the default class “unacceptable” 



A typical result is given in Fig. 0 This is the result obtained for the database 
“Car Evaluation”. The horizontal axis is the value of the total DL, and the 
vertical one is the number of misclassified cases. We see from this figure that the 
fewer misclassified cases for the test data set are, the smaller the total DL is. 
This tendency is seen in many other databases. 
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6.2 Experiment 2 to Compare the Effect of Using Both Data and 
Human Expert with the Standard RDR 

75% cases are selected from a database for a training data set, and the remaining 
25% cases used for a test data set. However, we use only the two thirds of 75% 
cases to construct the SE. This is to simulate that the SE does not have long 
experience in the field and to examine how the data can support constructing a 
good binary tree. The data available to construct the tree (and to calculate the 
total DLs) was assumed to be one third, two thirds and three thirds of the 75% 
cases. Three kinds of binary trees are constructed by the method proposed in 
subsection 15.21 using the SE and the respective data set as far as the total DL 
continues to decrease. These results are also compared with those corresponding 
to standard RDR trees that are constructed using the same SE and the same 
three data sets. Note that ten different knowledge bases are constructed for each 
default class of each data set changing the order of the training data at random. 
By taking the average of these 10 runs we have 94 points as the total for the 
whole 25 databases. 

Table |5| summarizes the number of wins (ties included) of the proposed 
method for the three different training data of different sizes. Fig. El shows the 
plots in the case of one thirds data set. 

Table 2. Number of wins (ties included) for the training data of different sizes 



training data size 


No. of wins (out of 94 points) 


one third of 75% 


65 points 


two thirds of 75% 


74 points 


three thirds of 75% 


80 points 




0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 

error rates for the standard RDR 



Fig. 6. Comparison of the proposed method and the standard method 



From these results, it is experimentally shown that the more data are avail- 
able, the more accurate knowledge bases can be constructed by the proposed 
method based on the MDLP than by the standard RDR, even when the SE has 
not enough expertise. Data can complement the lack of expertise. 
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6.3 Experiment 3 to Compare the Effect of Using Data with C4.5 

We consider constructing accurate knowledge bases from data alone without 
relying on the SE. Using only 75% cases and nothing else, we construct the 
knowledge base by the method proposed in subsection 15. II as far as the total DL 
continues to decrease. Because a different ordering of incoming cases and a dif- 
ferent default class result in a different knowledge base of RDR, we individually 
change these parameters, and construct a total of {the number of 
classes) x 10 knowledge bases. Then we select the knowledge base with the 
smallest total DL. This knowledge base is expected to be the best one that can 
be constructed by the proposed method. 

We compare the accuracy of this knowledge base with that of induction rule 
set obtained by C4.5 (This set is called “c4.5rules” below). The results for the 
25 databases are shown in Fig. Q where the horizontal axis is the error rates of 
c4.5rules and the vertical axis is the error rates of the selected knowledge bases. 

0.6 



S’ 0-3 

I 

S 

“ 0.1 
0 

0 0.1 0.2 0.3 0.4 0.5 0.6 

error rates for c4.5rules 

Fig. 7 . Comparison of the proposed method with the c4.5rules 

There are 11 databases out of 25 for which the knowledge bases constructed 
by the proposed method have equal or smaller error rates compared with those of 
c4.5rules. However, there are a few databases where the proposed method gives 
very high error rates. Especially, for those points noted as “only root node” in 
Fig. □ no new nodes are added to the starting root nodes and the resultant 
knowledge bases consist of the root nodes alone. This may be due to the way the 
starting point of the search is selected (see subsection 15. 1 |l . It would have been 
possible to construct knowledge bases as accurate as c4.5rules if changes were 
made to the starting conditions. Excluding these points with “only root node”, 
the accuracy is about the same as that of c4.5rules. 

7 Conclusion 

We explored possibility of integrating inductive learning (to extract knowledge 
from data) into the Ripple Down Rules Method (KA method from human ex- 
perts), and proposed to use the MDLP as an underlying principle. 

It is experimentally shown that data can complement the lack of expertise, 
i.e. the more data are available, the more accurate knowledge bases can be con- 
structed by the proposed method than by the standard RDR, even when a human 



only root node (Splice-junction) 



only root node °"'V (ConnocM) 

(Tic-Tac-Toe) + 
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expert has not enough expertise. On the other hand, it is also found that there 
are situations where no growth of knowledge bases is made and the predictive 
accuracy is much worse than c4.5rules, but this is rather rare. Overall, with 
our proposed method we can construct a knowledge base which is equivalent 
to c4.5rules when the same amount of data is allowed to be used. The datasets 
used in the experiments are all artificial although taken from standard bench- 
mark repositories. However, we prospect the sucess of the proposed approach for 
a real world dataset based on the known success of RDR approoach. 

We, thus, conclude that the proposed method enables to make effective use of 
both human expertise and accumulated data as separate sources of knowledge, 
and to switch between manual acquisition by a human expert and automatic 
induction from data at any point of knowledge acquisition. 
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Abstract. We introduce the construct of neighborhood dependency 
(ND) to express regularities like: “Families with similar size and income, 
tend to own cars of similar size.” Arguably, the discovery of such 
regularities is useful for prediction purposes. We have implemented and 
tested an algorithm for mining NDs. The discovered NDs are then used 
in the P -neighborhood method to predict unknown values. 

Keywords: instance-based learning, fc-nearest-neighbor, neighborhood 
dependency, P-neighborhood. 



1 Introductory Example 

Most data mining tasks involve predicting a target variable based on a number 
of predictor variables I3Z1. For example, predicting into what class a case falls 
(classification), or predicting what number value a variable will have (regression). 
In instance-based learning, one starts from a number of existing cases with known 
values for target and predictor variables. Each new case is then compared with 
existing ones using a distance metric on the predictor variables, and the closest 
existing cases are used to compute the value for the target variable. Obviously, 
the distance metric to determine closeness is crucial in this approach, and should 
satisfy the following intuitive property: 

If two cases are close w.r.t. the predictor variables, then the two cases 

should have similar values for the target variable. 

Without this property, close cases could have quite dissimilar values for the target 
variable, and hence closeness would be meaningless for prediction purposes. To 
make this intuitive property rigorous, we introduce the construct of neighborhood 
dependency (ND), which is exemplified in the next paragraph. But first we note 
that a database jargon will be used in the remainder of this paper; that is, 
variables will be denoted as “attributes,” and cases as “tuples.” 

For the example, suppose a mini telephone poll of 23 families has resulted in 
a relation over Tel, FSize, Income, Monovol. The attribute FSize is the number 
of persons in the family. Monovol is the target attribute, and indicates whether 
the family owns a monovolume car. That is, the main question is: can one predict 
whether a new family would be interested in buying a monovolume car, given 
its telephone number, size, and income. Part of the dataset is shown in Fig. ^ 
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predictor attributes 



target 



POLL 



Tel 


FSize 


Income 


Monovol 




053/664842 


6 


250K 


Y 


(G) 


016/260660 


5 


262. 5K 


Y 


(t2) 


116/296597 


3 


150K 


N 


(is) 




(23 tuples in total) 







Fig. 1. Example relation. 



We equip each attribute A with a closeness function which takes 

as its input two attribute values, and outputs a number between 0 and 1. The 
nearer the output is to 1, the closer the input values. For example, the closeness 
function 9 p Size on family sizes yields 1 if the two sizes are equal, 0.5 if they 
are one unit apart, and 0 otherwise. The closeness function OmouovoI for the 
attribute Monovol yields 1 if the two arguments are equal (both Y or both N), 
and 0 otherwise. 

A neighborhood predicate (NP) maps attributes to thresholds numbers be- 
tween 0 and 1. An example is: 

FSize'^'^ Income ^ , 

fixing a threshold of 0.5 for FSize, and 0.8 for Income. Two tuples t\ and ^2 are 
neighbors under this NP if the closeness function Opsize applied on the FSize- 
values in both tuples, i.e., on t\{FSize) and t 2 {FSize), yields a number > 0.5 and 
9 Income applied on the Income-values yields a number > 0.8. Fig.Qvisualizes this 
for two-dimensional space. Given a tuple t, the NP under consideration gives rise 
to a rectangle around t that distinguishes neighbors of t under FSize^'^ Income^ 
from non-neighbors; the figure depicts the neighborhoods around t\, t 2 , and t^. 
Note that such visualization is generally impossible, as attribute values may be 
nominal or the closeness function may not reflect Euclidean distance. 

An example neighborhood dependency (ND) is the expression: 

FSize^'^ Income^'^ — >■ Monovol^'^ , 

expressing the hypothesis that neighbors under FSize^'^ Income ^ are also neigh- 
bors under Monovol^'^ . Put in simple words, families of similar size and income, 
behave similar w.r.t. the ownership of monovolume cars. The strength of this 
hypothesis is measured by the notion of confidence, i.e., the probability that 
two tuples that are neighbors under FSize^'^ Income ^ are also neighbors un- 
der Monovol^'^ . In Fig. El the filled-circle denoted represents a family with a 
monovolume car; among the seven families in the rectangle around ti, five own 
a monovolume car, just like ti, but two don’t. We say that the confidence of the 
above ND in t\ is 5/7. The confidence in t^ is 2/3, since 2 of the 3 neighbors 
of fa are like t^: they don’t own a monovolume car. Another measure, called 
support, is introduced to reflect the number of neighbors of a tuple (7 in ti, 5 
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in ^ 2 , and 3 in t^). The foregoing confidence and support measures are defined 
relative to a given tuple in the data set. Additionally, weighted average support 
and confidence measures over all tuples are introduced, where the weight factor 
for a given tuple is proportional to its number of neighbors. 



300K r Income 



o 



200K - 




lOOK 




FSize 

j I 

7 8 



Fig. 2. Filled-circles represent tuples t for which t(Monovol) = Y; open-circles repre- 
sent tuples t for which t(MonovoI) = N. 



One data mining task is as follows: given a target right-hand side NP, like 
Monovol^ '^ , find a predictor left-hand side NP such that the resulting ND has 
sufficient support and confidence. This involves not only determining the at- 
tributes for the left-hand side, but also their associated threshold numbers (the 
width and the height of the rectangles in Fig. If the thresholds are low- 
ered, the neighborhood of each tuple — and hence its support — increases, but its 
confidence may decrease. 

Next, if a “good” left-hand side neighborhood predicate P is discovered, it is 
used for prediction purposes in what we call the P -neighborhood method. Clearly, 
an ND by itself does not tell us how to compute a target value when values 
for the predictor attributes are given. In this respect, it is different from, for 
example, decision trees and regression models, which provide effective procedures 
to compute a target value from predictor values. The proposed P -neighborhood 
method therefore relies on the existing tuples in the database, and hence can 
be categorized as instance-based learning. In particular, the P-neighborhood 
method predicts the target value of a new tuple based on all existing neighbors 
of the tuple under P. For example, if the ND FSize'^'^ Income ^ — )> Monovol^'^ 
turns out to have high support and confidence, then, given the values for FSize 
and Income of a new tuple t, the value t{M onovolume) is predicted by taking 
into consideration all neighbors of t under FSize^'^ Income^ that exist in the 
database. 
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The important contribution of our work is the introduction of a simple yet 
semantically meaningful notion of “neighborhood” that can be used to make 
predictions in a way that can be easily understood. It is more intuitive than, for 
example, the factor k or the distance metric used in /c-nearest-neighbor, where 
there is no preset definition of what constitutes a “good” distance metric or k. 



2 Experimental Results 



The problem of finding NDs that satisfy specified confidence and support thresh- 
olds, is NP-hard if the number of attributes in the dataset is used as a complexity 
measure. An algorithm for mining NDs was implemented in ANSI C. 

We tested our methodology on the El Nino dataset [^, which contains oceano- 
graphic and surface meteorological readings taken from a series of buoys posi- 
tioned throughout the equatorial Pacific. We used 782 readings from 23 May 98 
to 5 June 98; the dataset contains 9 attributes. The training set consisted of 587 
(75%) randomly chosen tuples; the test set contained the remaining 195 tuples. 
For each numerical attribute A, we applied the closeness function: 



&A(x,y) 



0 

1 - 



l^-yl 

max^ — miriA 



if a: or j/ is missing 
otherwise 



where max^^ and min^i denote the maximal and minimal values for A found in 
the dataset under consideration. For each categorical attribute A, the closeness 
function was: 




1 if both X and y are non-missing and x = y 
0 otherwise 



Note that these closeness functions account for the many missing values in the 
dataset. We discovered the “strong” ND: 

Buoy4f^'^ AirT'^'^ -A SeaSurfaceT^'^ , 



indicating that the buoy number and the air temperature together determine the 
sea surface temperature. It is interesting to add that the confidence gain w.r.t. 
the neighborhood dependency {} — >■ SeaSurfaceT^'^ , with an empty left-hand, 
was significant. 

The P-neighborhood method with P — Buoy=ff^'^ AirT^'^ was then applied 
to predict SeaSurfaceT-vahies for each tuple in the test set. That is, for each 
tuple t in the test set, the value t(SeaSurfaceT) was predicted by averaging the 
SeaSurfaceT -values of all neighbors of t under Buoy=ff^'^ AirT^'^ in the training 
set. 



In order to assess the prediction quality, the distance between observed and 
predicted values was characterized by the Root Mean Square (RMS) error of 
the predictions converted to a percentage of the mean, called coefficient of RMS 
error (CRMS): 



CRMS = 100 X 




N 
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where O are the observed values in the test set, E are the expected values 
predicted by P-neighborhood, N is the cardinality of the test set, and the sum- 
mation is over all tuples of the test set. The P-neighborhood method yielded a 
CRMS of 1%, indicating a high-quality prediction. 

For these data, a linear relationship between AirT and SeaSurfaceT is known. 
Predicting target values in the test set by a linear regression model, yielded a 
CRMS of 1.9%, and hence was less precise than the P-neighborhood method. 
Next, in the same training set, the ND: 

Buoy^^'° — >■ Latitude^'^ 

had a confidence of 1.0. It expresses that the latitude of a buoy changes very 
little. Predicting Latitude-values in the test set by the P-neighborhood method 
with P = Buoy=ff^'^, yielded a CRMS of 2%. As Buoy^ is in fact a non-numeric 
attribute, classical linear regression is not applicable here. 

Recently, we also tested our approach on the Meningoencephalitis Diagnosis 
dataset, donated by Dr. Shusaku Tsumoto (Shimane Medical University, Japan). 
The results are promising. 



3 Related Work 

NDs generalize functional dependencies (FDs) [H by comparing attribute values 
for similarity instead of equality. For example, the FD Buoy^ — > Latitude would 
express that the same buoy stays at the same latitude. As buoys move around 
to different locations, this FD does not hold for the El Nino dataset. Unlike 
NDs, FDs cannot express that the latitude of a buoy can change little from the 
average latitude. 

NDs are different from quantitative association rules (QARs) jS|, which as- 
sociate intervals to attributes. For example, 

AirT : 27..2S ^ SeaSurfaceT : 28. .29 , 

stating that the sea surface temperature is between 28 and 29 if the air tem- 
perature is between 27 and 28. An ND, on the other hand, reveals an overall 
relationship between AirT and SeaSurfaceT . One can expect the existence of 
strong QARs in the presence of strong NDs (the opposite is not necessarily 
true). Significantly, a QAR expresses a relationship between values within tu- 
ples, whereas NDs also express relationships among tuples. 

One of the incentives of our work was to overcome certain semantic difficulties 
with roll-up dependencies (RUDs) jn|. In that work, attribute values are com- 
pared for equality after “rolling up” the values to a higher level of abstraction. 
For example, the cities Brussels and Mons are equal at the country level (both 
roll up to Belgium). The problem lies in the treatment of numeric attributes, 
where the roll-up is usually determined by fixed intervals. For the attribute 
Income, these intervals may be [0K..99K], [100K..199K], [200K..299K],. . . Then 
lOOK and 199K are equal at the interval level, but 199K and 200K are not, and 
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the roll-up will treat 199K as being closer to lOOK than to 200K, which is coun- 
terintuitive. A similar problem has been raised for QARs NDs avoid these 
problems by using a closeness function instead of intervals: 199K will be much 
closer to 200K than to lOOK. 

The P-neighborhood method is inspired by, but different from, fc-nearest- 
neighbor: to predict the value for a (numeric of categorical) target attribute A 
of a new tuple t, we rely upon all existing P-neighbors of t — rather than on the 
k “nearest neighbors” of t. The neighborhood predicate P itself is the result of a 
preceding data mining task: P results from the discovery of a rule P — > A” with 
high support and confidence, and r close to 1. 

4 Conclusion 

We proposed the P-neighborhood method for predicting a numeric or categorical 
target. It is an instance-based learning method that predicts the target of a new 
tuple based on its existing P-neighbors in the database. The construct of P- 
neighbor has natural semantics. The actual value for P is established during a 
preceding data mining task. Tests on real-life datasets show that the method is 
easy to apply (for example, it can treat numeric as well as categorical attributes; 
it can deal with missing values; default closeness functions can be used), and 
that the quality of the predictions is very promising. 
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Abstract. In this paper, a new method, called EM-EA, is put forward for 
learning Bayesian network structures from incomplete data. This method 
combines the EM algorithm with an evolutionary algorithm (EA) and 
transforms the incomplete data to complete data using EM algorithm and then 
evolve network structures using the evolutionary algorithm with the complete 
data. In order to learn Bayesian networks with hidden variables, a new mutation 
operator has been introduced and the function of the crossover has been 
correspondingly expanded. The results of the experiments show that EM-EA is 
more accurate and practical than other network structure learning algorithms 
that deal with the incomplete data. 



1 Introduction 

A Bayesian network is applied more and more widely, and has become a main 
method to deal with the uncertainty in the field of artificial intelligence'^'. Particularly, 
in recent years there has been a growing interest in learning Bayesian networks from 
data''""""’' . At present, there have been effective methods for structure learning and 
parameter learning from complete data and good methods for parameter learning from 
incomplete data under fixing network structure. However, there are few effective and 
efficient methods for learning the network structures from incomplete data. Further, it 
is an especially difficult problem to learn network structures with hidden variables. 

In 1998, Friedman put forward structural expectation-maximization algorithm, 
which he named MS-EM‘‘". In his method, EM algorithm'*' and greedy search 
algorithm are employed. But when the search space is very large and multimodal 
landscape, the greedy search algorithm will stop at the local optimal model. 

In 1996, Larranaga et al discussed learning network structure using an 
evolutionary algorithm'". The results of their experiments show that their method can 
learn good network structures and avoid getting into the local maxima with complete 
data. But for incomplete data, the results are not ideal. 

* This research has been supported by Natural Science Foundation of China, National 973 
Fundamental Research Program and 985 Program of Tsinghua University. 
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In 1999, W. Myers et al improved Larranaga’s work to make it adapt to 
incomplete data™. Their method not only evolved network structures but also evolved 
missing data to complete the incomplete data using generic operations. 

While their method met the efficiency problem due to the enlarged search space 
and the convergence problem caused by the strong randomness of the genetic 
operators for the missing data. 

In this paper, we present a new method called EM-EA. Compared to the work 
before, our method makes two improvements: (1) combines the EM algorithm with 
evolutionary algorithm organically, solves effectively the network structure learning 
problem from incomplete data and the problem of getting into local maxima; (2) 
expands EA of W. Myers et al to learn Bayesian networks with hidden variables. 



2 Evolutionary Algorithm 

The very large, multi-dimensional, multi-modal landscape immediately suggests the 
use of evolutionary algorithms. A Bayesian network can be broken down into local 

structures a variable and all its parents that can be considered genes. Then the 

whole network structure can be represented as a chromosome. Eurthermore, We can 
use the MDL score as the fitness function evaluating the network structure. 

Eormally, the structure, 

S, can be represented as an 
adjacency list, see Eigure 1, 
where each row represents a 
variable v. and the parents of 
V. , Tiy . The adjacency list 
can be thought of as a 
chromosome, where each row 
is a gene and the Uy are the 
alleles. This representation is 
convenient because the log form of MDL is the summation of scores for each 
variable. Because of this, each gene can be scored separately and added to generate 
the fitness score for the entire structure. Of course, this assumes the MDL is closed 
which is the case for complete data. 

However, there is no closed form expression for evaluating structures when the 
data are incomplete. In literature [9], W. Myers et al turn the incomplete data problem 
into a complete data problem by evolving the missing data and imputing these values 
into the data. So, they evolve not only the network structures but also the missing 
values. They represent each cell from the dataset that has a missing value as a gene. 
The gene takes on sampled values from the set of values of the corresponding 
variable. The chromosome is a string of missing values. 

Eor the missing data chromosomes, W. Myers et al chose uniform parameterized 
crossover'^’''™. As for the mutation operator, they randomly select a value from the 
remaining possible values of the corresponding variable. They also use uniform 
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parameterized crossover for the structure chromosome. They employed three basic 
mutation operators for the network structure chromosome. Two of them are adding 
and deleting a node to a gene. They have the effect in the phenotype of adding and 
deleting arcs respectively. The third one is reversing an arc, which is implemented in 
the phenotype by deleting the parent-child arc and adding the child-parent arc. 



3 EM-EA Algorithm 

The EA of W. Myers et al can avoid getting into the local maxima, but it has also two 
disadvantages. One is that it exponentially enlarges the search space (the number of 
missing dataxnetwork structure). When the number of missing data is big, the search 
space is so large that the efficiency of the algorithm will be very low and it is difficult 
to get satisfactory results. The more important is that the completion from incomplete 
data to complete data achieved by the generic operators has strong randomness and 
can not reflect the probability distribution that the missing data actually follow. So, it 
is difficult for their method to assure its convergence. 

As for the disadvantages of the EA of W. Myers et al, we combine the EM 
algorithm with evolutionary algorithms organically, handle the incomplete data with 
EM algorithm, and learn Bayesian network structures with evolutionary algorithms. 
In addition, in order to make our method be able to learn the network structure with 
hidden variables, we improve the EA of W. Myers by introducing a new mutation 
operator and expanding the function of the crossover operator. 

The mutation operator that we introduced can add some new vertices and arcs to 
the network and delete some arcs from the network. However, we can not add vertices 
and arcs arbitrarily, and we must follow some criteria while employing this operator. 
Our criterion is when finding some vertices depend on each other and connect thickly, 
we then add a vertex representing a hidden variable to the network. The parents of the 
vertex added are the common parent nodes of those vertices depending on each other, 
while the parent nodes of those vertices depending on each other are replaced by the 
vertex added. So the interdependent relationship among those vertices is represented 
by a hidden variable. And thus simplify the network structure. The experiments show 
that when there are three variables whose parent sets have a common subset, this 
mutation operator can be used for evolutionary calculation. 




Fig. 2. a) An example of a net- (b) The simplest network that can capture the same 
work distribution without using the hidden variable 
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A simple example, originally given by Binder et al'”, is shown in Figure 2. In 
figure 2, the network structure (b) can be evolved to network structure (a) using our 
mutation operator. The corresponding adjacency lists for network structures in figure 
2 are shown in Figure 3. The concrete process is as follows: by analyzing the 
adjacency list in Figure 3b, we can find that vertices A, B and C appear together most 
frequently in the alleles and the corresponding vertices whose alleles include A, B and 
C, are D, E and F. Therefor, we add a new gene corresponded with a hidden variable 
H in the adjacency list whose allele is ABC, and replace the alleles of D, E and F with 
H. Thus the adjacency list shown in Figure 3a is formed whose corresponding 
network structure is (a) in Figure 2. 

But the introduction of our mutation 
operator also raises a new problem. The 
adding of the gene has changed the length 
of the chromosome after we apply this 
new mutation operator and thus brought 
difficulties for applying the crossover 
operator. So we have to expand the 
crossover operation. The concrete means 
is to add a corresponding virtual gene 
that is correspondent with the hidden 
variables to the shorter chromosome to 
make the lengths of the two 
chromosomes making crossover 
operation same. The so-called virtual gene means its allele is empty and its 
corresponding variable does not appear in the alleles of other genes. In fact, adding a 
virtual gene equals to adding an isolated vertex in the network. After adding virtual 
genes, the two chromosomes can make the usual crossover operation. 

After expanding the evolutionary algorithm mentioned above, EM-EA algorithm 
can find and evolve network structures with hidden variables. The whole process of 
the EM-EA method is as follows: 



A 


A 


B 


B 


C 


C 


DIH 


DIABC 


EIH 


EIABCD 


EIH 


EIABCD 


HIABC 




(a) 


(b) 



Fig. 3. The adjacency list corresponded 
with the networks in Figure 2 



(1) Complete the incomplete dataset D using the current network S ^ and 
EM algorithm, and get the complete dataset ; 

(2) As for the original group 5^ , make crossover or mutation operations, 
and get the evolved group 5^. . 

(3) As for each network S in 5^,, do as follows: 

a) Examine if network S is a directed acyclic graph. If it is, then 
calculate the fitness according to the MDL score function; 
otherwise assign network S a small value. 

b) Calculate the selected probability of network S according to 
the fitness of S . 
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(4) Choose X individuals having the highest selected prohabilities from 

to form the next generation. Where X represents the size of the 
evolutionary group. 

(5) Select S , make F^, = maxarg(F^) .If Fg, > , then = S • 

(6) Judge if the terminative c&ndition of the algorithm is satisfied. If 
satisfied, then quit; otherwise, go to (1) and continue the above process. 



4 Experiment Results 




In order to validate our method, we compare the EM-EA algorithm with the EA of W. 
Myers et al and Eriedman’s MS-EM algorithm respectively. 

While comparing 
our method with the EA 
of W. Myers et al, for 
convenience, we also 
use the Bayesian 
network known as 
ASIA, which has been 
used by W. Myers'^'. 

Eurthermore, we use 
the same experiment 
process as that of W. 

Myers. In addition, the 
original evolutionary 
group is obtained with 
the following method: 

from the 1000 samples that include missing data mentioned above, we use a computer 
program to create a supposed complete dataset, and choose some better network 
structure as the original group based on the complete dataset. The current network can 
be selected at random from the original group. In the experiments, mutation and 
crossover probability are set as 0.05 and 0.5 respectively; the group size is set as 40. 

Eigure 4 shows the log loss score of the two algorithms for each level of missing 
data. As can be seen from the figure, both EA and EM-EA could find good predictive 
networks at 0%, 5%, and 15% missing data. While at 30% the predictive accuracy of 
the two algorithms degrades sharply. However, the performance of the EM-EA is 
better than that of the EA of W. Myers, especially at the 30% missing data. 

While comparing our method with Eriedman’s MS-EM, we use the same 
experimental conditions as that of MS-EM''”. The stopping criterion for the algorithm 
is set at 1000 generations. Except that the mutation rate is set as 0. 1 and the group size 
is set as 50, the selection of the original group, current network and crossover 



Fig. 4. The comparison of the two 
algorithms in terms of Log Loss 
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probability are the same as that in the 
above experiment. We tested the -o, 
average log-loss of our algorithm on a 
separate test set. The results are j 
summarized in Figure 5. 

From Figure 5, we can see that 
in terms of the comparison of these 
two algorithms, the performance of 
the EM-EA is comparable with that of 
the MS-EM for the large samples, but 
the former is more robust as the 
hidden variables varies. And for the 
middle-sized samples, the EM-EA 
works better than MS-EM, especially 
when the hidden variables increase. 
Whereas, when the size of the 
samples is small, EM-EA performs 
worse than EM-EA. The reason 
possibly lies in that evolutionary 
algorithms need more samples than greedy 




Fig. 5. The comparison of the two 
algorithms in terms of learning 
performance of network structures with 
hidden variables 

search algorithms to some extent. 



5 Conclusion and Future Work 

The results of the experiments verified the validity of our method. Compared with the 
EA of W. Myers et al, our algorithm is more accurate and efficient. And in terms of 
learning network structures with hidden variables, our algorithm is comparable with 
MS-EM. Flowever, MS-EM starts with a given set of hidden variables and attempts to 
find a model that includes them. While our algorithm could create hidden variables on 
as-needed basis during the learning process, and so is more flexible and practical than 
MS-EM. 

Next, we will test our method further with BDe score. In addition, we are 
planning to explore application of this method. In particular, we will try to put this 
method into use for the macroeconomics prediction. 
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Abstract. We introduce an interactive decision tree construction sys- 
tem, DTViz, which consists of five components and maintains two in- 
teraction windows, and attempts to integrate the user’s preference and 
domain knowledge into the construction process. 



1 Introduction 

There are several visualization systems for constructing decision trees. Most of 
them visualize the decision tree structure only [6, 7]. The PBC system devel- 
oped in [1, 2] is very close to our work and attempts to interactively construct 
decision trees. However, it uses circle segments technique to visualize the orig- 
inal data, and some visualization space is wasted and the number of tuples in 
the training data is limited. Secondly, the decision tree being constructed is not 
visually displayed so that the user is not able to clearly see what’s going on and 
make decision at critical moments, though [1] improves. Finally, the PBC system 
doesn’t provide any approaches for cleaning the raw data. 

We present a novel approach and develop a visualization system, DTViz, 
for interactively building decision trees based on our Rule Viz model [3, 4] by 
visualizing the entire process. DTViz is a fully interactive system. During the 
decision tree construction, the user can integrate the domain knowledge, see the 
intermediate decision trees, evaluate tree nodes, and feedback his/her perception. 

2 The DTViz System 

The DTViz system consists of the following components: data tuple visualization, 
data reduction, decision tree node construction node evaluation, and decision 
tree visualization, shown in Fig. 1. 

DTViz maintains two interaction windows, data window and tree window. 
The training data are visualized in the data window in order for the user to 
see the data distribution vertically and horizontally. The data window is often 
limited and can not accommodate large amount of tuples. If scrollable windows 
are used, the user is not capable of observing the entire data set at the first 
glance and thus the user’s attention is distracted. Thus, large data sets must be 
reduced. The selected attributes and randomly sampled tuples (if necessary) are 
used as the training examples for constructing decision trees. 

D. Cheung, G.J. Williams, and Q. Li (Eds.): PAKDD 2001, LNAI 2035, pp. 575-580, 2001. 
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Fig. 1. The DTViz System 



Decision trees are interactively constructed by iteratively performing the last 
three components and switching between the two windows. The data window 
visualizes the data tuples contained in the current node. Initially, it visualizes 
the entire data set corresponding to the decision tree root. The construction of 
decision trees is an interactive process with feedback from the user. 

The current decision tree is visualized in a separate window, and is drawn 
from left to right with the root at the middle of left side. The nodes in the same 
level of the tree are uniformly arranged. 

3 Data Visualization and Reduction 

We combine two techniques. Circle segments [2] and parallel coordinates [3] to 
develop a new visualization technique, called parallel segments. Fig. 1 illustrates 
this technique, where the left-top snapshot demonstrates the visualization of a 
six dimensional data set. 

The visualization area is divided into d equal sized segments for d-dimensional 
data set with each segment corresponding to an attribute. Within each segment, 
the pixels are arranged to start from the left bottom and end at the right top in 
a row-by-row and bottom-up fashion. 

To visualize the training tuples, we map the attribute values occurring in 
the data set to the pixels in the attribute columns. The attribute values are 
sorted by each attribute and mapped to the pixels in the arrangement order. 
The pixels are rendered as the color determined by the class label of tuples to 
which attribute values belong. The color scale for class labels used in DTViz is 
derived from the PBC system [2] which is based on the HSI color model. 

The size of data that can be visualized in parallel segments is determined 
by the data window size and the number of attributes. For large data sets, it 
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is necessary to select important attributes with respect to the class labels and 
sample the typical tuples. To visualize the data values in parallel segments, 
the tuples are sorted by each attribute. DTViz uses the quicksort algorithm. 
Assume k attributes, then the sorting algorithm will run in 0{knlogn). The 
sorting must be done as long as the data set is updated as the decision tree 
construction proceeds. For large k and n, DTViz randomly sample the original 
data set. 

DTViz also provides an interactive feature selection mechanism, in which the 
user can arbitrarily delete any attributes that he/she thinks irrelevant or not 
strongly related to the class attribute. Flow many attributes and which attributes 
should be deleted or retained can be determined based on the user’s perception 
and domain knowledge. 

4 Decision Tree Visualization 

The intermediate trees and the Rnal decision tree are visualized in the tree win- 
dow. The root is at the middle of the first column. The children of the root are 
evenly distributed in the second column. Generally, all the tree nodes at the i-th 
level are evenly arranged in the i-th column {i > 1), as shown in Fig. 3. 

There are two types of tree nodes, labeled and unlabeled. Labeled nodes rep- 
resent the leaf nodes of the final tree that can not be split further and are labeled 
with the most frequent class labels respect with to the nodes. The labeled nodes 
are drawn as rounded rectangles, and rendered in terms of the node evaluation 
and the FISI color model. Assuming that the prediction accuracy and support 
of a labeled node are a and s, respectively. Then the node color is calculated as 
follows: hue = 0.5 -f 2 x a, intensity = 0.5 -f 2 x saturation = 1.0, where n is the 
size of the test data set. Moreover, the class label and the {support, accuracy) 
pair are displayed in the labeled node with rounded rectangles. The unlabeled 
nodes are drawn as rectangles, which are filled with the split attributes. If an 
unlabeled node is under construction, then it is left blank and ready to be split. 

5 Interactive Construction of Decision trees 

The decision tree is constructed from the training set. DTViz provides five in- 
teraction operations for the user to interactively build decision trees based on 
his/her perception and node evaluation. Fig. 2 depicts the interaction model 
developed in DTViz. 

Initially, the decision tree contains only the root node, that covers the entire 
training set. The tree window displays the blank root. At this moment, the root 
is the current node for split. 

As the decision tree construction proceeds, the user can arbitrarily select 
interaction operations to view the state of tree nodes and to control the growth 
and shrink of the decision tree. The left four operations in Fig. 2 change the 
current decision tree. The data window may need updated to reflect the current 
tree node. The final decision tree is obtained when all leaf nodes are labeled. 
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Fig. 2. The interaction model of constructing decision trees used in DTViz 



Node Split 

The node split includes three steps, selecting a split node, selecting a split 
attribute, and selecting split points of the attribute. The first step is performed 
in the tree window. The labeled nodes can not be selected for split. 

The second and third steps are performed in the data window. When a node 
is specified to be split, the data window only visualizes the data tuples covered 
by this node, and the segments corresponding to the attributes that appear in 
the nodes on the path from the root to the current node are left blank. 

The split attribute can be interactively specified by the user. Attribute selec- 
tion follows two strategies: (1) the more clear the clusters in a parallel segment, 
the better the corresponding attribute for splitting] and (2) The more approximate 
the size of clusters and non-clusters, the better the corresponding attribute. 

To select a split point for the split attribute, one just needs to click upon 
the pixel that separates clusters. Note that the separation of two different colors 
is not the only criteria for determining the exact split point because the same 
attribute values may belong to tuples of different classes. To solve this problem 
and help the user identify the reasonable split points, the DTViz system provides 
feedback to the user in the following ways: (1) the attribute value of the pixel at 
the position of the mouse pointer is displayed in the status area at the bottom line 
of the data window when the mouse is moving] and (2) the scroll bar provides 
another means of viewing attribute values and class labels. First, point the mouse 
around the boundary of two differently colored clusters to get a rough range in 
which the possible split point is because the attribute values are sorted. Then 
slightly move the slider around the range to see if the boundary is actually a 
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split point. 

The split points are displayed above the scroll bar. DTViz allows the user to 
split an attribute into more intervals at one time. 

Additionally, following the splitting strategy discussed in [2], one can parti- 
tion the coherent regions of values in the split attribute column. 

Node Labeling/Unlabeling 

The labeled node is drawn as a rounded rectangle and rendered with the 
color calculated in the method discussed before. The class label that occurs most 
frequently in the data set covered in this node is found to be the node label and 
displayed in the node with the node classihcation accuracy and coverage. 

The node to be labeled must be a leaf node of the current tree. To guarantee 
the labeled node has high accuracy so that the decision tree is optimized, one 
can Rrst evaluate the node to see its classification accuracy and coverage before 
labeling it. 

If one changes his/her mind, the labeled node can be unlabeled. The unla- 
beled node is restored to a leaf node, which can be split again. 

Node Evaluation 

In any time during the construction of decision trees, one can evaluate any 
nodes in the current decision tree. The node evaluation includes finding the node 
class label that most frequently occurs in the node tuples; the node support, 
which is the number of node tuples; and the node classification accuracy, which 
is calculated as the occurring frequency of the node label in the set of tuples 
covered by the node. 

Decision Tree Pruning 

The decision tree pruning is needed in the following cases: (1) the user is not 
satisfied with the structure of the current decision tree; (2) the user feels hard to 
split some unlabeled leaf nodes to get high node evaluation; or (3) the final tree 
is too large. Note that only non-leaf node can be pruned, and the pruned node 
is not removed, while its all descendants are removed. The pruned node can be 
re-split or just labeled to be a final leaf node. 

6 DTViz Implementation and Experiment 

We implement the approach described in this chapter with Visual J-f-f 6.0 on 
Windows 98. We experiment the DTViz system with data sets from the UCI 
repository [5], including Adult, Iris, Car, Flag, Breast-Cancer, etc. Fig. 3 illus- 
trates the decision tree construction with the Adult data set. 

The Adult database consists of 14 condition attributes with 6 continuous 
and 8 nominal attributes. The class attribute Salary has two values, > SO/F and 
<= SO/F . Due to the data window size, 5000 instances are randomly sampled and 
visualized. In Fig. 3, the left window shows a decision tree under construction. 
Four leaf nodes are labeled, and four nodes are unlabeled, three of which are 
split and one is to be labeled or split, depending on the node evaluation and 
the user’s decision. The right window visualizes the data set contained in a tree 
node in the third level since two attributes have already been split. 
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Fig. 3. The tree window and the data window 



7 Conclusion 

We presented an visualization system, DTViz, for interactively constructing de- 
cision trees. DTViz consists of five components and two interaction windows, 
To visualize the training data, a pixel-oriented visualization technique, paral- 
lel segments, is developed. The strategies for selecting split attributes and split 
points are discussed. The five interactive operations help the user grow, prune, 
and revise the decision tree iteratively until the final result is satisfying. The 
characteristics of DTViz include easy to use, uncertain results, varying accu- 
racy, understandable decision tree structure, and on-demand node and attribute 
discretization. 
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Abstract. Data mining applications require learning algorithms to have 
high predictive accuracy, scale up to large datasets, and produce compre- 
hensible outcomes. Naive Bayes classifier has received extensive attention 
due to its efficiency, reasonable predictive accuracy, and simplicity. How- 
ever, the assumption of attribute dependency given class of Naive Bayes 
is often violated, producing incorrect probability that can affect the suc- 
cess of data mining applications. We extend Naive Bayes classifier to 
allow certain dependency relations among attributes. Comparing to pre- 
vious extensions of Naive Bayes, our algorithm is more efficient (more 
so in problems with a large number of attributes), and produces simpler 
dependency relation for better comprehensibility, while maintaining very 
similar predictive accuracy. 



1 Introduction 

Learning Bayesian classifier is a process of constructing a special Bayesian net- 
work from a given set of pre-classified examples, each of which is represented 
by a vector of attribute values. Assume Ai, A2,..., are n attributes which 
take values Oi, 02, , ..., a„ respectively. Those attributes will be used collectively 
to predict the value c of another attribute C, called class label. 

According to the Bayesian rule, the probability of an example E being in 
class c is: 



p{C = c|ai,02,...,a„) 



p{ai,a2 , ..., an\C = c)p{C = c) 
p(ai,02, ...,a„) 



The classification is taken as the C’s value with the largest probability. 
Assume all attributes are independent given the class. That is: 



p(ai,02,...,a„|c) = p{ai\c)p{a2\c)...p{an\c) 



The resulting Bayesian classifier is called the Naive Bayesian classifier. 

Where strong dependent relations do exist among attributes (such as Ai = 
Aj), the probability of Naive Bayes will not be correct. Friedman (1997) pre- 
sented his work on learning tree-like Bayesian networks, TAN (Tree-Augmented 
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Naive Bayes), in which non-classification attributes can form a tree structure. 
A TAN is a compromised representation between a full Bayesian network and 
Naive Bayes. 

The basic idea is to approximate the underlying probability distribution by 
conditional mutual information. Its time complexity is O(n^), where n is the 
number of the attributes. 0 Keogh and Pazzani (1999) present another approach 
to learn TANs, which searches heuristically for a TAN guided by the predictive 
accuracy. Their algorithm, called SuperParent^ also has the time complexity 
O(n^). They show that their algorithm consistently predicts more accurately 
than Friedman (1997) ’s TAN, which in turn, predicts more accurately than the 
Naive Bayes. SuperParent consists of two major steps. The first step searches 
for a best super parent that improves the predictive accuracy the most. A super 
parent is a node with arcs pointing to all other nodes without a parent (not 
counting the class label). The second step determines one best child for the 
super parent chosen at the first step, again based on the predictive accuracy. 
After this iteration of the two steps, one arc is added on the TAN, and this 
process repeats until no improvement is achieved, or n — 1 arcs are added into 
the tree. Obviously, SuperParent is a greedy algorithm with complexity 0{n^). 
Our algorithm, called StumpNetwork, is constant-factor faster than SuperParent, 
while maintaining very similar predictive accuracy. The constant becomes larger 
in domains with a large number of attributes. This speed-up is important for 
data mining problems with a large number of attributes. 



2 An Improved Algorithm for Learning TANs 

We extend Keogh and Pazzani (1999) by proposing a more efficient algorithm, 
called StumpNetwork, to construct a special class of TAN. The motivation of 
StumpNetwork derives from the observation that the dependence among at- 
tributes tends to cluster into groups in many read-world domains with a large 
number of attributes. Attributes in each cluster form a simple dependency re- 
lation: a one-level tree structure, called the tree stump0 For example, in the 
customer database of a commercial bank, the amount of deposit, credit, and 
debt may be dependent on customer’s total income, while the house price, util- 
ity bills, and neighbourhood are dependent on the postal code. But those tree 
stumps may not be totally independent, so some simple dependency relations 
are allowed among tree stumps. That it, we search for a special class of the tree 
structure as the topology of our Bayesian network. 

Our algorithm constructs simple tree stumps first, and then construct links 
among the tree stumps. Similar to SuperParent (Keogh and Pazzani, 1999), 
search in StumpNetwork is guided by the predictive accuracy. That is, our cri- 
terion of constructing tree stumps and links between tree stumps is solely to 
improve the predictive accuracy of the Augmented Naive Bayesian classifier. 

^ The time complexity is also linear to the size of the training set. But this term is 
same for all Augmented Naive Bayesian classifiers in our discussion, so we omit it in 
our paper. 

^ Holte (1993) studied tree stumps as decision tree classifiers. 
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2.1 StumpNetwork 

We first define the kind of tree structures as our hypothesis space, and then 
describe greedy search strategies for finding such tree structures. 

Definition 1: A tree T{r, N), where r is an attribute and TV is a set of attributes, 
is called a tree stump, if the tree is of root r and of height 1, and there is an arc 
from r to each node in N. 

Definition 2: A set of tree stumps is called Stump Network if the intersection 
of any two tree stumps is empty, and tree stumps may be connected in such a 
way that the root of one tree stump is pointed by at most a leaf node of another 
tree stump. 

Figure 1 shows an example of a Stump Network. Clearly Stump Network is 
a special form of the tree topology where attribute dependencies tend to form 
clusters. 





Fig. 1. An example of a Stump Network, a special class of TAN. Note that classification 
node C and all links from C to all Ai for all i are omitted for simplicity. 



Our algorithm for searching Stump Network heuristically consists of two 
main steps. In the first step, it finds out a set of tree stumps, if adding such 
tree stumps into the Naive Bayes improves the the predictive accuracy on a 
testing set. The resulting set of tree stumps are sorted by the improvement of 
the predictive accuracy. The second step goes through the sorted list of the tree 
stumps once, to see if each tree stump can be pointed by a leaf of the previous 
tree stumps in the sorted list that will result in an improved predictive accuracy. 
If so, such a link will be remained in the Stump Network. The algorithm called 
StumpNetwork is presented below. 

1. Read the data D 

2. Initialize B to be Naive Bayes and calculate its predictive accuracy 

3. Let node set N be all attributes (except C')and Tree Stump queue be empty 

4. Make a tree stump for each node in N. Let T„ be the tree stump with the 
highest improvement on the predictive accuracy 
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5. For each arc on T^, if the predictive accuracy does not decrease after deleting 
the arc, remove it from Tg 

6. Put Tg in TreeStump queue 

7. Remove all nodes of Tg from N . If fV is not empty, go to 4 

8. Go over TreeStump queue once, for each tree stump in the queue, add a link 
from a leaf of the previous tree stumps in the queue to the root of this tree 
stump, if the predictive accuracy increases 

Similar to SuperParent, our algorithm finds one super parent that links to 
all other possible attributes with the highest improvement on the predictive 
accuracy (Step 4). However, unlike SuperParent which keep just one child of the 
parent node found in the previous step (thus SuperParent needs to loop n times), 
we keep all the child nodes they do not result in a decreased predictive accuracy 
(Step 5). The resulting one-level tree forms a tree stump. Tree stumps may be 
linked together in Step 8. 

2.2 Theoretical Comparison of StumpNetwork to SuperParent 

From the description of the StumpNetwork algorithm above, the time for the 
creation of the tree stumps (from step 4 to step 7) is O(n^) and the time for step 
8 is 0{n), where n is the number of attributes. Therefore, we consider the time 
complexity of constructing tree stumps as the time complexity of StumpNetwork 
Tc- We have: 



N-l 

Tc{n) = 2n + 2{n — mi) + 2{n — mi — m 2 ) + ■■■ + 2{n — mi) (1) 

i=l 



where N is the number of the tree stumps, and is the size of the tree stump 
i. Let k = J2f=i We have: 



Te(n) 



n{n + k) 
k 



(2) 



Because the tree stump found in Step 4 improves the predictive accuracy, and 
deleting edges in Step 5 will not decrease the predictive accuracy, the resulting 
tree cluster must have its size greater than 1. That is, k > 2. 

It is easy to obtain the time complexity of SuperParent, Tg, as (Keogh and 
Pazzani 1999): 



Tg{n) = 2n + 2{n — 1) -I- ... -I- 1 = n(n -I- 1) (3) 



We omit the items of 0{n) in Equation 2 and Equation 3, then we have: 



Tcjn) ^ 1 
Tg{n) k 



(4) 



Since k>2, Tc{n) < l/2Tg{n). The greater the k, the less Tc{n) than Tg{n). 
In many data mining applications, the number of attributes is often quite 
large (hundreds to thousands). In addition, those attributes can be clustered 
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into several large groups. For these problems, our algorithm will be many times 
more efficient than SuperParent. The greater of the number of the attributes in 
each group, the more efficient our algorithm will be. 

The empirical results shown in the next section verify this conclusion. How- 
ever, little is lost with a simplified TAN and an improved efficiency: the predictive 
accuracies of the two algorithms will be shown to be very similar in the next 
subsection. 

2.3 Experimental Results of StumpNetwork 

We compare StumpNetwork with SuperParent using several datasets from the 
UCI repository (Merz, 1997), and one real-world dataset from a bank that we 
worked on in a previous project. Table 1 lists the properties of the datasets we 
used in our experiment. 



Table 1. Descriptions of domains used in our experiments. 



Dataset 


Attributes 


Class 


Instances 


Ecoli 


7 


8 


336 


Vote 


16 


2 


435 


Pima 


8 


2 


768 


Australia 


14 


2 


690 


Breast 


10 


2 


683 


Segment 


19 


7 


1540 


Vehicle 


18 


4 


846 


Bank 


20 


2 


1162 



Our experiments follow the procedure below: 

1. The continuous attributes in the dataset are discretized by Fayyad(1993)’s 
entropy-based method. 

2. Calculate the average predictive accuracy of SuperParent and StumpNetwork 
respectively, with 5 fold cross validation. 

Table 2 shows the experimental results. As we can see, StumpNetwork 
achieves essentially the same testing accuracy as SuperParent. On the other 
hand, the time used in constructing Augmented Naive Bayes is quite different. 
On average over the datasets we compared, StumpNetwork is 4.8 times faster 
than SuperParent. There is a clear trend of larger saving with larger datasets 
and datasets with a larger number of attributes. 

3 Conclusions 

In this paper we present a new algorithm for constructing a special kind of 
tree augmented Naive Bayes based on work by Keogh and Pazzani (1999). Our 
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Table 2. Comparison of our Stump Network and Super Parent. 



Dataset 


StumpNe, 

Accuracy 


twork 

Time 


SuperPt 

Accuracy 


irent 

Time 


Ecoli 


84.43il.07 


4.08 


83. nil. 16 


7.74 


Vote 


96.07il.99 


16.33 


94.02i2.23 


39.63 


Pima 


78.98i2.28 


2.75 


79.39il.83 


6.36 


Australia 


85.43il.83 


17.97 


86.07il.47 


53.59 


Breast 


96.38i0.70 


37.17 


96.56i0.91 


97.61 


Segment 


93.37i0.96 


1675.80 


94.55i0.78 


8275.60 


Vehicle 


69.10il.86 


152.39 


68.91il.l3 


668.55 


Bank 


53.22i 1.62 


348.37 


53.58il.94 


1726.20 


Average 


81.87 


281.9 


82.02 


1359.4 



algorithm works best in domains with a large number of attributes, and at- 
tributes tend to form large clusters with simple dependency relations inside 
clusters. Both experimental and theoretical analyses show that our algorithm 
is constant-factor faster than Keogh and Pazzani (1999) ’s algorithm, and the 
constant is proportional to the size of the clusters. Our algorithm also produces 
simpler tree structure than an arbitrary tree, thus producing more comprehensi- 
ble results. Empirical comparisons demonstrate that both algorithms have very 
similar predictive accuracies. 

In our future research, we will study other efficient and specialized Aug- 
mented Naive Bayes suitable for domains possessing certain properties commonly 
occurring in real-world applications. 
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Abstract. The application of neural networks to domains involving pre- 
diction and classification of symbolic data requires a reconsideration and 
a careful definition of the concept of distance between patterns. Tra- 
ditional distances are inadequate to access the differences between the 
symbolic patterns. This work proposes the utilization of a statistically 
extracted distance measure in the context of Generalized Radial Basis 
Function (GRBF) networks. The main properties of the GRBF networks 
are retained in the new metric space. The regularization potential of 
these networks can be realized with this type of distance. Furthermore, 
the recent engineering of neural networks offers effective solutions for 
learning smooth functionals that lie on high dimensional spaces. 



1 Introduction 

The emergence of neural network (NN) technology [P offers valuable solutions 
to solve complicated data mining problems. Patterns arising both from com- 
mercial databases and from many engineering databases (as those that describe 
biosequences) involve data defined over a space that lacks the fundamental prop- 
erties of distance metric spaces. This work constructs a proper distance metric 
for expressing the distance between values of features in symbolic domains. This 
metric owns some geometric properties that make it effective in the context of 
the regularization formulation of the Generalized Radial Basis Function (GRBF) 
networks. Regularization techniques impose the learning of a smooth functional 
from the network m- Therefore, it is justifiable to expect from the network 
to be able of learning the underlying smooth dependence of the outcomes on 
the attributes, even in the presence of noise that induces the perturbation. The 
potential of this distance metric to regularize the solution of the GRBF networks 
is the theoretical justification of the improved performance related to the simple 
nearest neighbor schemes. The paper proceeds as follows: Section 2 presents the 
proposed Statistical Distance Metric (SDM) . Section 3 discusses how the statis- 
tical distance is fitted in the context of GRBF networks. Section 4 introduces 
the SDM within the framework of the GRBFs. Section 5 discusses the heuristic 
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instance based parsing of the training set in order to improve the GRBF pa- 
rameters (i.e. the selection of centers and their spreads). In the last section are 
presented the conclusions of the present work. 

2 The Statistical Distance Metric (SDM) 



The key problem for applications involving symbolic features is the definition 
of the distance metric. In domains where features are numeric, it is straightfor- 
ward to compute the distance between two points in the pattern space in terms 
of a geometric distance. Indeed, the traditional RBF learning algorithms have 
been formulated and operate effectively in numeric domains with such distances. 
However, when the features are symbolic (as is usually in data mining applica- 
tions using databases from bioinformatics or characteristic to a certain type of 
disease), the utilization of the traditional types of distances yields inadequate 
performance. There are two common approaches for handling symbolic informa- 
tion: one is the overlap method and the second the orthogonal representation |51 
1^, both of them yielding poor performance in case of symbolic data. In order 
to be able to obtain an effective formulation of the distances between patterns 
with symbolic feature values we have adapted the distance measure proposed in 
0. This statistical distance measure takes into account the overall similarity of 
classification of all instances for each possible value of each feature. The method 
extracts with a statistical approach from the training set, a matrix that defines 
the distances between all possible values of a given feature. Therefore, a separate 
matrix for each feature is obtained. The distance measure for a specific feature 
is defined according to the following equation: 



N 

d{VA,VB) = J2\ 



Ca 



|fc 

Cb ' 



( 1 ) 



In the equation above, Va and Vb denote two possible values for the feature, 
e.g. for the DNA promoter data they will be two nucleotides. The distance 
between the values is the sum over all the N classes. For example, for the DNA 
promoter example (discussed below) there are two classes, either the sequence 
is a promoter (i.e. a sequence that initiates a process called transcription) or 
not. The number of patterns for which the value Va {Vb) is classified to class 
i , is denoted by (Cb. ). Also, the total number of patterns of class A (B) 
is denoted by Ca (Cb), and k is a constant usually set to 1. These counts are 
computed over all patterns of the training set. It becomes easily evident that 
the more correlated are the classifications of patterns pertaining to two values 
for a feature the smallest is their statistical distance computed with equation 
(1). Therefore for feature values belonging to training set patterns with similar 
classifications a small statistical distance will be computed. The distance between 
two patterns is obtained by a weighted sum of distances between the values of 
the individual features of these patterns: 
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F 

D{X,Y) = Y,Wfd{Vx,,VyJ (2) 

i=l 

where F is the number of features, Wf^ accounts for the weight assigned to 
feature fi reflecting its significance and r is a parameter that controls how dis- 
tances between individual features scale for the computation of the total pattern 
distance (usually r=l or 2). Also, Vxi and Vyi denote the values for the ith 
feature of X and Y. 



3 Generalized RBFs with the Statistical Distance Metric 
(SDM) 



The Generalized Radial Basis Functions networks explore the Tikhonov’s regu- 
larization theory for obtaining a good generalization performance, as described 
in m One prerequisite for the application of SDM distance type is to have 
enough training data for the accurate construction of the SDM space. However, 
the training sets of size large enough for providing the essential information for 
generalization, provide also the necessary information for the computation of an 
effective distance matrix. In contrast to example based nearest neighbor learning 
schemes, the GRBF learns a smooth functional that weights the contribution of 
each example subject to the requirements imposed by the regularizing term for 
the smoothness of the solution. This fact is the theoretical explanation for the 
superior performance of GRBF networks related to the Instance Based Learning 
(IBL) schemes. A parameter of particular importance is the region of influence 
of the GRBF kernels that is determined by their spread parameter a. This prob- 
lem becomes more complicated within the domain of statistical distances and 
the heuristic suggestions of [Q to compute cr as 

dmax /o\ 

= -7^ ( 3 ) 

V2m 



where dmax is the maximum distance and m the number of RBF centers has not 
been proved effective in practice. In order to obtain an effective setting for the 
spread parameter, a sensible approach is to obtain at the first step an estimate 
of the average distance dav, of patterns within the space defined with the SDM. 
Then the region of influence of the RBF kernels is designed by requiring that at 
a particular distance Spread from the RBF center expressed in units of dav , the 
attenuation of influence is decreased by a. Mathematically, this requirement is 
formulated as: exp{—DF ■ dav ■ Spread) = a and therefore the required parameter 
DF is derived as: 



DF = 



-logja) 
Spread ■ dav 



(4) 



Values of these parameters that realize good results are for example Spread = 
5 and a = 0.01 meaning that at a distance from an RBF center 5 times larger 
than the average distance between patterns, the influence of the RBF function 
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attenuates with a factor of 0.01. The RBF centers own an influence at a distance 
X from their center expressed by: exp{—DF ■ x) However, since the above scheme 
trains globally the spreads of RBF centers the peculiarities and irregularities of 
the state space are ignored. An additional instance based learning step that is 
described in the following section can estimate the relative importance of each 
RBF center and therefore can improve the performance of the designed RBF 
solution. 



4 Instance-Based Learning for the Determination of the 
Parameters of the RBF Networks 



It is highly desirable to exploit the reliable examples as centers of the RBF 
network. Also, the more reliable an example is, larger should its region of influ- 
ence be when the example is used as an RBF center. The extent of the region 
of influence is expressed with the spreading parameter a of the RBF center. A 
heuristically driven learning strategy is adopted for the determination of the 
examples that should be used as RBF centers and of their widths. The proposed 
GRBF training approach consists of two steps. At the first step, the Instance 
Based Learning (IBL) step, successive learning steps evaluate the potential of 
each example for serving as an RBF center, i.e. how representative the example 
is. This step is of a heuristic type and it tries to discover the reliability and the 
importance of the training examples with an instance based learning scheme that 
resembles the functionality of PEBLS 0 . This solution can be implemented with 
nearest neighbor schemes and if it is viewed as an input-output mapping it tends 
to create many class boundaries and discontinuous ’’islands” of misclassifled re- 
gions placed near erroneously classified examples. The structure of the decision 
boundaries is smoothed and most of the regions with artifacts are extracted to 
reject the influence of noisy examples at the designed classification system. These 
examples do not yield satisfactory performance at the initial IBL step, so they 
are not selected as RBF centers. The second learning step constructs the Green’s 
matrix with the estimated spreads of the Gaussian kernels estimated from the 
heuristically driven first step. During the first step, an empirical approximation 
to the solution is constructed. There are three basic approaches that can be ex- 
ploited at the first heuristic learning pass. 

1) The one pass approach is an exemplar weighting method that is used in 
conjunction with the nearest neighbor parameter. The learning is accomplished 
with only one pass through the training examples. At this training step, for each 
training instance its k nearest neighbors are found from among the remaining 
training set. If j neighbors have a matching class then the weight is assigned 
to the current instance according to the simple formula: weigth = 1 + k — j 
Therefore, the more the class of the exemplar is reinforced by its neighbors, the 
less the weight (i.e. the more reliable the exemplar is). Algorithmically, the one 
pass instance based learning algorithm takes the form: 
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for each pattern P of the training set do begin 

1 detect the k nearest neighbors to P from the training set 
according to the SDM; 

2 Let j=number of nearest neighbors with the same class label 
as the class of P; 

3 Set the weight parameter that quantifies the reliability 
of the exemplar as weight= 1+k-j 

end; 

The other two approaches that we tested for the weighting were 2) the used 
correct and, respectively, 3) the increment method. 

5 Applications 

We have applied the GRBF based solutions to a variety of data mining prob- 
lems both from the engineering domain and from the commercial databases 
domain. Below we describe shortly one application from bioinformatics, one 
from data mining of commercial databases and finally some examples using 
databases from the UCI Machine Learning Repository, only from the medical 
field (http://www.ics.uci.edu/ mlearn/MLRepository.html). The first applica- 
tion concerns the prediction of promoter sequence 0. This task involves pre- 
dicting whether or not a given subsequence of a DNA sequence is a promoter, 
i.e. a sequence of genes that initiates a process called transcription. The data set 
contains 106 examples, 53 of which were positive examples (promoters) and the 
rest negative ones. A training pattern consists of a sequence of 57 nucleotides 
(features) from the alphabet a, c, g and t with the respective classification (pro- 
moter or not promoter). Since the available number of patterns were small the 
classification performance was tested with the leave-one-out methodology, i.e. 
repeatedly trials have been performed by training on 105 examples and testing 
on the remaining one. The computed performance was 2/106 (i.e. an average of 
2 errors over 106 trials) versus 4/106 for a competitive experiment that used the 
KB ANN neural network model jS|. 

We can observe from the Table 1 that the utilization of IBL within the 
framework of GRBFs improves the generalization performances obtained with 
the classic PEBLS algorithm. However, we cannot easily conclude that a par- 
ticular IBL learning approach (from those described in the previous section) is 
better. 

6 Conclusions 

Neural network algorithms for learning are very effective in domains in which 
all features have numeric values. At these domains, the examples are treated as 
points and distance metrics obeys to standard definitions. However the usual 
domain of data mining applications is the symbolic domain. The utilization of 
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Table 1. Performances of the proposed GRBF + IBL data mining algorithm 



Database 


PEELS 


GRBF 


GRBF-I-IBL 

One — pass 


GRBF+IBL 

U sed — correct 


GRBF-I-IBL 

Increment 


Hypothyroid 


97.90 


98.04 


98.33 


98.34 


98.29 


Breast cancer 


94.23 


95.8 


96.01 


96.12 


96.08 


Iris 


94.62 


95.2 


96.22 


96.2 


95.09 


Hepatitis 


76.59 


78.23 


79.45 


84.31 


81.29 


Liver Disorders 


63.45 


62.98 


65.9 


72.5 


74.56 


Heart Disease 


81.90 


82.34 


82.28 


83.20 


85 


Audiology 


77.90 


78.91 


81.06 


81.03 


79.44 



the traditional distance metrics for data mining with neural networks usually 
results in modest results. The paper has adapted a SDM for application in the 
context of the GRBF neural networks. This distance metric extends the area 
of effectiveness of GRBF neural networks to the symbolic world. The results 
indicate that the generalization potential of neural networks can be utilised for 
patterns with symbolic features when the learning and evaluation algorithms are 
designed with the statistically extracted distance metric. 

Future work to upgrade further the proposed GRBF and IBL hybrid data mining 
algorithms can proceed along many different directions, such as: finding optimal 
multisplits (for numerical attributes), or using simulated annealing algorithm 
(for the discretization of the continuous attributes). 
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